Trending Papers - Hugging Face

new

Get trending papers in your email inbox once a day!

Get trending papers in your email inbox!

byAK and the research community

Submitted by

taesiri

MiroThinker: Pushing the Performance Boundaries of Open-Source Research Agents via Model, Context, and Interactive Scaling

We present MiroThinker v1.0, an open-source research agent designed to advance tool-augmented reasoning and information-seeking capabilities. Unlike previous agents that only scale up model size or context length, MiroThinker explores interaction scaling at the model level, systematically training the model to handle deeper and more frequent agent-environment interactions as a third dimension of performance improvement. Unlike LLM test-time scaling, which operates in isolation and risks degradation with longer reasoning chains, interactive scaling leverages environment feedback and external information acquisition to correct errors and refine trajectories. Through reinforcement learning, the model achieves efficient interaction scaling: with a 256K context window, it can perform up to 600 tool calls per task, enabling sustained multi-turn reasoning and complex real-world research workflows. Across four representative benchmarks-GAIA, HLE, BrowseComp, and BrowseComp-ZH-the 72B variant achieves up to 81.9%, 37.7%, 47.1%, and 55.6% accuracy respectively, surpassing previous open-source agents and approaching commercial counterparts such as GPT-5-high. Our analysis reveals that MiroThinker benefits from interactive scaling consistently: research performance improves predictably as the model engages in deeper and more frequent agent-environment interactions, demonstrating that interaction depth exhibits scaling behaviors analogous to model size and context length. These findings establish interaction scaling as a third critical dimension for building next-generation open research agents, complementing model capacity and context windows.

· Published on Nov 14, 2025

Submitted by

taesiri

MiroThinker: Pushing the Performance Boundaries of Open-Source Research Agents via Model, Context, and Interactive Scaling

Submitted by

taesiri

LTX-2: Efficient Joint Audio-Visual Foundation Model

LTX-2 is an open-source audiovisual diffusion model that generates synchronized video and audio content using a dual-stream transformer architecture with cross-modal attention and classifier-free guidance.

· Published on Jan 6, 2026

Submitted by

taesiri

Submitted by

JiaaqiLiu

SimpleMem: Efficient Lifelong Memory for LLM Agents

To support reliable long-term interaction in complex environments, LLM agents require memory systems that efficiently manage historical experiences. Existing approaches either retain full interaction histories via passive context extension, leading to substantial redundancy, or rely on iterative reasoning to filter noise, incurring high token costs. To address this challenge, we introduce SimpleMem, an efficient memory framework based on semantic lossless compression. We propose a three-stage pipeline designed to maximize information density and token utilization: (1) Semantic Structured Compression, which applies entropy-aware filtering to distill unstructured interactions into compact, multi-view indexed memory units; (2) Recursive Memory Consolidation, an asynchronous process that integrates related units into higher-level abstract representations to reduce redundancy; and (3) Adaptive Query-Aware Retrieval, which dynamically adjusts retrieval scope based on query complexity to construct precise context efficiently. Experiments on benchmark datasets show that our method consistently outperforms baseline approaches in accuracy, retrieval efficiency, and inference cost, achieving an average F1 improvement of 26.4% while reducing inference-time token consumption by up to 30-fold, demonstrating a superior balance between performance and efficiency. Code is available at https://github.com/aiming-lab/SimpleMem.

8 authors

· Published on Jan 5, 2026

Submitted by

JiaaqiLiu

SimpleMem: Efficient Lifelong Memory for LLM Agents

8 authors

· Jan 5, 2026

Submitted by

andito

Submitted by

andito

Submitted by

hongyuw

BitNet b1.58 2B4T Technical Report

BitNet b1.58 2B4T, a 1-bit Large Language Model with 2 billion parameters, matches the performance of full-precision models while improving computational efficiency.

· Published on Apr 16, 2025

Submitted by

hongyuw

BitNet b1.58 2B4T Technical Report

BitNet b1.58 2B4T, a 1-bit Large Language Model with 2 billion parameters, matches the performance of full-precision models while improving computational efficiency.

Submitted by

hao-li

Agent READMEs: An Empirical Study of Context Files for Agentic Coding

Agentic coding tools receive goals written in natural language as input, break them down into specific tasks, and write or execute the actual code with minimal human intervention. Central to this process are agent context files ("READMEs for agents") that provide persistent, project-level instructions. In this paper, we conduct the first large-scale empirical study of 2,303 agent context files from 1,925 repositories to characterize their structure, maintenance, and content. We find that these files are not static documentation but complex, difficult-to-read artifacts that evolve like configuration code, maintained through frequent, small additions. Our content analysis of 16 instruction types shows that developers prioritize functional context, such as build and run commands (62.3%), implementation details (69.9%), and architecture (67.7%). We also identify a significant gap: non-functional requirements like security (14.5%) and performance (14.5%) are rarely specified. These findings indicate that while developers use context files to make agents functional, they provide few guardrails to ensure that agent-written code is secure or performant, highlighting the need for improved tooling and practices.

11 authors

· Published on Nov 17, 2025

Submitted by

hao-li

Agent READMEs: An Empirical Study of Context Files for Agentic Coding

11 authors

· Nov 17, 2025

Submitted by

buaahsh

BitNet Distillation

BitNet Distillation fine-tunes large language models to 1.58-bit precision using SubLN, multi-head attention distillation, and continual pre-training, achieving comparable performance with significant memory and inference speed improvements.

Submitted by

buaahsh

BitNet Distillation

Submitted by

sliuau

Submitted by

sliuau

Submitted by

akhaliq

Submitted by

akhaliq

Submitted by

unilm

VibeVoice Technical Report

VibeVoice synthesizes long-form multi-speaker speech using next-token diffusion and a highly efficient continuous speech tokenizer, achieving superior performance and fidelity.

Submitted by

unilm

VibeVoice Technical Report

VibeVoice synthesizes long-form multi-speaker speech using next-token diffusion and a highly efficient continuous speech tokenizer, achieving superior performance and fidelity.

Submitted by

taesiri

Submitted by

taesiri

Submitted by

XuGuo699

Submitted by

XuGuo699

Submitted by

taesiri

Submitted by

taesiri

Submitted by

akhaliq

Submitted by

akhaliq

Submitted by

taesiri

Submitted by

taesiri

Submitted by

rajkumarrawal

Recursive Language Models

We study allowing large language models (LLMs) to process arbitrarily long prompts through the lens of inference-time scaling. We propose Recursive Language Models (RLMs), a general inference strategy that treats long prompts as part of an external environment and allows the LLM to programmatically examine, decompose, and recursively call itself over snippets of the prompt. We find that RLMs successfully handle inputs up to two orders of magnitude beyond model context windows and, even for shorter prompts, dramatically outperform the quality of base LLMs and common long-context scaffolds across four diverse long-context tasks, while having comparable (or cheaper) cost per query.

Submitted by

rajkumarrawal

Recursive Language Models

Submitted by

taesiri

HunyuanVideo 1.5 Technical Report

HunyuanVideo 1.5 is a lightweight video generation model with state-of-the-art visual quality and motion coherence, using a DiT architecture with SSTA and an efficient video super-resolution network.

· Published on Nov 24, 2025

Submitted by

taesiri

HunyuanVideo 1.5 Technical Report

HunyuanVideo 1.5 is a lightweight video generation model with state-of-the-art visual quality and motion coherence, using a DiT architecture with SSTA and an efficient video super-resolution network.

Self-Supervised Prompt Optimization

A self-supervised framework optimizes prompts for both closed and open-ended tasks by evaluating LLM outputs without external references, reducing costs and required data.

· Published on Feb 7, 2025

Self-Supervised Prompt Optimization

A self-supervised framework optimizes prompts for both closed and open-ended tasks by evaluating LLM outputs without external references, reducing costs and required data.

Submitted by

taesiri

Submitted by

taesiri

Submitted by

wanderkid

Submitted by

wanderkid

Submitted by

akhaliq

Mem0: Building Production-Ready AI Agents with Scalable Long-Term Memory

Mem0, a memory-centric architecture with graph-based memory, enhances long-term conversational coherence in LLMs by efficiently extracting, consolidating, and retrieving information, outperforming existing memory systems in terms of accuracy and computational efficiency.

· Published on Apr 28, 2025

Submitted by

akhaliq

Submitted by

yantaiyang05

Submitted by

yantaiyang05

Submitted by

Paper99

Submitted by

Paper99

Submitted by

taesiri

Submitted by

taesiri

Submitted by

Cxxs

Submitted by

Cxxs

OmniVCus: Feedforward Subject-driven Video Customization with Multimodal Control Conditions

Existing feedforward subject-driven video customization methods mainly study single-subject scenarios due to the difficulty of constructing multi-subject training data pairs. Another challenging problem that how to use the signals such as depth, mask, camera, and text prompts to control and edit the subject in the customized video is still less explored. In this paper, we first propose a data construction pipeline, VideoCus-Factory, to produce training data pairs for multi-subject customization from raw videos without labels and control signals such as depth-to-video and mask-to-video pairs. Based on our constructed data, we develop an Image-Video Transfer Mixed (IVTM) training with image editing data to enable instructive editing for the subject in the customized video. Then we propose a diffusion Transformer framework, OmniVCus, with two embedding mechanisms, Lottery Embedding (LE) and Temporally Aligned Embedding (TAE). LE enables inference with more subjects by using the training subjects to activate more frame embeddings. TAE encourages the generation process to extract guidance from temporally aligned control signals by assigning the same frame embeddings to the control and noise tokens. Experiments demonstrate that our method significantly surpasses state-of-the-art methods in both quantitative and qualitative evaluations. Video demos are at our project page: https://caiyuanhao1998.github.io/project/OmniVCus/. Our code will be released at https://github.com/caiyuanhao1998/Open-OmniVCus

14 authors

· Published on Jun 29, 2025

OmniVCus: Feedforward Subject-driven Video Customization with Multimodal Control Conditions

14 authors

· Jun 29, 2025

Submitted by

taesiri

DeepCode: Open Agentic Coding

DeepCode, a fully autonomous framework, addresses the challenges of document-to-codebase synthesis by optimizing information flow through source compression, structured indexing, knowledge injection, and error correction, achieving state-of-the-art performance and surpassing human experts.

5 authors

· Published on Dec 8, 2025

Submitted by

taesiri

DeepCode: Open Agentic Coding

Submitted by

amael-apple

Submitted by

amael-apple

Submitted by

leo1117

Submitted by

leo1117

Submitted by

zgzaacm

Submitted by

zgzaacm

Submitted by

zbhpku

Submitted by

zbhpku

Submitted by

akhaliq

Submitted by

akhaliq

LightRAG: Simple and Fast Retrieval-Augmented Generation

LightRAG improves Retrieval-Augmented Generation by integrating graph structures for enhanced contextual awareness and efficient information retrieval, achieving better accuracy and response times.

5 authors

· Published on Oct 8, 2024

Submitted by

taesiri

SAM Audio: Segment Anything in Audio

SAM Audio, a diffusion transformer-based foundation model, achieves superior performance in general audio separation using unified text, visual, and temporal span prompts across various audio types.

Submitted by

taesiri

SAM Audio: Segment Anything in Audio

SAM Audio, a diffusion transformer-based foundation model, achieves superior performance in general audio separation using unified text, visual, and temporal span prompts across various audio types.

Submitted by

Weiyun1025

Submitted by

Weiyun1025

Submitted by

Rbin

RAG-Anything: All-in-One RAG Framework

RAG-Anything is a unified framework that enhances multimodal knowledge retrieval by integrating cross-modal relationships and semantic matching, outperforming existing methods on complex benchmarks.

Submitted by

Rbin

RAG-Anything: All-in-One RAG Framework

RAG-Anything is a unified framework that enhances multimodal knowledge retrieval by integrating cross-modal relationships and semantic matching, outperforming existing methods on complex benchmarks.

Submitted by

akhaliq

Very Large-Scale Multi-Agent Simulation in AgentScope

Enhancements to the AgentScope platform improve scalability, efficiency, and ease of use for large-scale multi-agent simulations through distributed mechanisms, flexible environments, and user-friendly tools.

· Published on Jul 25, 2024

Submitted by

akhaliq