alphaXiv

7 min read Original article ↗

Memory in the Age of AI Agents

This extensive survey introduces a unified framework for understanding agent memory, categorizing its architectural forms, functional roles, and operational dynamics. It also distinguishes agent memory from related concepts like LLM memory and RAG, providing a structured overview of the rapidly evolving field.

Error-Free Linear Attention is a Free Lunch: Exact Solution from Continuous-Time Dynamics

Error-Free Linear Attention (EFLA) provides an exact, closed-form solution to the continuous-time dynamics underlying linear attention, achieving linear-time complexity by exploiting a rank-1 property of the dynamics matrix. The Nanyang Technological University research shows this approach delivers superior numerical stability and robustness against noise and out-of-distribution inputs, alongside improved performance in language modeling and common sense reasoning tasks.

Native and Compact Structured Latents for 3D Generation

Researchers at Tsinghua University, Microsoft Research, and the University of Science and Technology of China developed a novel "field-free" sparse voxel representation, O-Voxel, that natively encodes both arbitrary 3D geometry and physically-based rendering (PBR) materials. This system, combined with a Sparse Compression VAE and flow-matching generative models, produces high-fidelity, PBR-textured 3D assets from single images with 16x spatial compression and generates 1024³ resolution assets in approximately 17 seconds.

QwenLong-L1.5: Post-Training Recipe for Long-Context Reasoning and Memory Management

QwenLong-L1.5, a 30B parameter language model from Alibaba Group's Tongyi Lab, was developed with a post-training recipe to enhance long-context reasoning capabilities. The model achieves an average score of 71.82 across six long-context benchmarks, demonstrating performance comparable to leading proprietary models and extending context processing to 4 million tokens via a memory-augmented architecture.

MMGR: Multi-Modal Generative Reasoning

MMGR presents a multi-modal benchmark to evaluate generative AI's understanding of physical, logical, and spatial reasoning, moving beyond perceptual fidelity. The study reveals state-of-the-art models exhibit widespread reasoning deficits and a significant gap between perceived visual quality and actual world consistency.

Motus: A Unified Latent Action World Model

Motus is a unified latent action world model that integrates five distinct generative capabilities for embodied agents, leveraging pretrained vision-language and video generation models with an optical flow-based latent action representation. The model achieved state-of-the-art results in simulation, with over 45% absolute improvement in multi-task settings on RoboTwin 2.0, and boosted real-world robotic task success rates by up to 48% on the AC-One platform.

Universal Reasoning Model

Researchers at Ubiquant developed the Universal Reasoning Model (URM), an enhanced Universal Transformer, which achieved new state-of-the-art performance on abstract reasoning benchmarks including ARC-AGI 1 (53.8% pass@1), ARC-AGI 2 (16.0% pass@1), and Sudoku (77.6% accuracy) by strengthening recurrent inductive bias and nonlinear components.

Let's (not) just put things in Context: Test-Time Training for Long-Context LLMs

Researchers from Meta, Harvard University, OpenAI, and other institutions introduced Query-only Test-Time Training (qTTT) to enhance large language models' ability to use information in long contexts. This method dynamically adapts the attention mechanism during inference to counteract 'score dilution', yielding average performance improvements of 12.6% on LongBench-v2 and 14.1% on ZeroScrolls for Qwen3-4B models.

Qwen-Image-Layered: Towards Inherent Editability via Layer Decomposition

Qwen-Image-Layered introduces an end-to-end diffusion model that decomposes an RGB image into multiple semantically disentangled RGBA layers, providing inherent editability. The approach achieves improved decomposition quality and reconstruction fidelity compared to prior methods while enabling consistent image manipulation.

Nemotron-Cascade: Scaling Cascaded Reinforcement Learning for General-Purpose Reasoning Models

NVIDIA's Nemotron-Cascade framework enables the development of general-purpose Large Language Models with enhanced reasoning capabilities by employing a cascaded reinforcement learning approach. This method sequentially fine-tunes models across diverse domains, achieving state-of-the-art performance in areas such as competitive programming, math, and software engineering, while supporting both "thinking" and "instruct" modes within a single unified model.

Seedance 1.5 pro: A Native Audio-Visual Joint Generation Foundation Model

ByteDance Seed's Seedance 1.5 pro is a foundational model for native audio-visual joint generation, achieving superior synchronization and quality by simultaneously synthesizing both modalities. The model demonstrates leading performance in Chinese-language audio generation and cinematic camera control, leveraging a unified multimodal architecture and extensive post-training optimization to surpass contemporary models.

Hindsight is 20/20: Building Agent Memory that Retains, Recalls, and Reflects

HINDSIGHT introduces a novel memory architecture for AI agents, integrating structured memory (TEMPR) with preference-conditioned reasoning (CARA) to enhance long-term recall and behavioral consistency. The system achieves state-of-the-art results on challenging benchmarks, reaching 91.4% accuracy on LongMemEval and 89.61% on LoCoMo, significantly outperforming existing memory systems and full-context baselines, even with open-source LLM backbones.

The Allen Institute for AI's Olmo 3 project introduces a fully-open ecosystem of language models, making available all data, code, and checkpoints for transparency and reproducibility. The initiative offers 7B and 32B parameter models, with the Olmo 3.1 Think 32B variant achieving state-of-the-art performance among fully-open thinking models and approaching leading open-weight models despite training on significantly fewer tokens.

In Pursuit of Pixel Supervision for Visual Pre-training

FAIR (Meta) introduces Pixio, an enhanced masked autoencoder for visual pre-training that leverages pixel supervision on 2 billion web images. Pixio achieves competitive or superior performance compared to leading latent-space methods like DINOv2/v3 across various dense visual tasks, including depth estimation, 3D reconstruction, semantic segmentation, and robot learning, by improving MAE's decoder, masking strategy, and class token usage.

Reasoning Within the Mind: Dynamic Multimodal Interleaving in Latent Space

Researchers developed DMLR, a test-time framework that enables Multimodal Large Language Models to perform dynamic, confidence-guided reasoning by iteratively refining latent "think tokens" and adaptively injecting visual information. This approach improves reasoning and perception performance across diverse tasks, achieving better accuracy-efficiency trade-offs without requiring additional model training.

ReFusion: A Diffusion Large Language Model with Parallel Autoregressive Decoding

REFUSION introduces a diffusion large language model with parallel autoregressive decoding, achieving superior generation quality and a 2.33x faster inference speed compared to strong autoregressive models. This is accomplished by a slot-based "plan-and-infill" decoding mechanism that enables full Key-Value (KV) cache reuse and addresses the coherence issues found in prior masked diffusion models.

DiffusionVL: Translating Any Autoregressive Models into Diffusion Vision Language Models

In recent multimodal research, the diffusion paradigm has emerged as a promising alternative to the autoregressive paradigm (AR), owing to its unique decoding advantages. However, due to the capability limitations of the base diffusion language model, the performance of the diffusion vision language model (dVLM) still lags significantly behind that of mainstream models. This leads to a simple yet fundamental question: Is it possible to construct dVLMs based on existing powerful AR models? In response, we propose DiffusionVL, a dVLM family that could be translated from any powerful AR models. Through simple fine-tuning, we successfully adapt AR pre-trained models into the diffusion paradigm. This approach yields two key observations: (1) The paradigm shift from AR-based multimodal models to diffusion is remarkably effective. (2) Direct conversion of an AR language model to a dVLM is also feasible, achieving performance competitive with LLaVA-style visual-instruction-tuning. Further, we introduce a block-decoding design into dVLMs that supports arbitrary-length generation and KV cache reuse, achieving a significant inference speedup. We conduct a large number of experiments. Despite training with less than 5% of the data required by prior methods, DiffusionVL achieves a comprehensive performance improvement-a 34.4% gain on the MMMU-Pro (vision) bench and 37.5% gain on the MME (Cog.) bench-alongside a 2x inference speedup. The model and code are released at this https URL.

MemFlow: Flowing Adaptive Memory for Consistent and Efficient Long Video Narratives

MemFlow, developed by researchers from HKU, HKUST(GZ), and Kuaishou Technology, introduces a framework featuring Narrative Adaptive Memory and Sparse Memory Activation to generate long, interactive video narratives with improved consistency and efficiency. The system achieves a Quality Score of 85.02 and Consistency Score of 96.60 in multi-prompt 60-second video generation, while operating at 18.7 FPS on an NVIDIA H100 GPU.

Step-GUI Technical Report

A comprehensive framework advances practical Graphical User Interface (GUI) agents by introducing a self-evolving training pipeline reducing data acquisition costs by 10-100x, a standardized protocol with privacy features, and a real-world evaluation benchmark. The resulting compact 4B and 8B Step-GUI models achieve state-of-the-art performance across diverse GUI automation tasks, demonstrating improved efficiency and privacy protection.

There are no more papers matching your filters at the moment.