TTD-RAG: A Test-Time Diffusion Framework for the MMU-RAG Competition
This repository contains our submission for the MMU-RAG Competition, a deep research agent named TTD-RAG. Our system is a faithful implementation of the framework proposed in the paper "Deep Researcher with Test-Time Diffusion (TTD-DR)". This README is generated by gemini 2.5.
It conceptualizes report generation as an iterative "denoising" process, starting with a preliminary draft and progressively refining it through cycles of targeted search, synthesis, and revision. This approach is designed to excel at complex, multi-hop reasoning tasks that require coherent, long-form answers.
🎯 Key Features
- Test-Time Diffusion Framework: Models research report generation as an iterative process of refining a "noisy" draft with external information, ensuring coherence and reducing information loss.
- Report-Level Denoising with Retrieval: Uses an evolving draft to dynamically guide the search process, ensuring each retrieval step is targeted at filling specific knowledge gaps.
- Component-wise Self-Evolution: Enhances the quality of each step in the workflow (planning, synthesis) by generating diverse variants, critiquing them, and merging them into a superior output.
- High-Performance Serving: Utilizes vLLM to serve both the generative (
Qwen/Qwen3-4B-Instruct-2507) and reranking (tomaarsen/Qwen3-Reranker-0.6B-seq-cls) models for high throughput and low latency. - Competition Compliant: Fully supports both dynamic (streaming) and static evaluation endpoints as required by the competition rules, validated with the provided
local_test.pyscript.
⚙️ System Architecture & Workflow
The agent operates in a structured, multi-stage process orchestrated by src/pipeline.py:
-
Stage 1: Planning & Initial Drafting
- An initial Research Plan is generated to outline the key areas of investigation.
- A preliminary Noisy Draft is created based on the LLM's internal knowledge, serving as the starting point for the diffusion process.
-
Stage 2: Iterative Search & Denoising
- The system enters a loop, where for each iteration:
- A new search query is generated, informed by the current draft's deficiencies and the overall plan.
- Documents are retrieved from the FineWeb Search API.
- The retrieved documents are chunked and reranked using a specialized model to find the most relevant information.
- The top-ranked chunks are synthesized into a concise answer for the search query.
- The draft is revised ("denoised") by integrating this new information.
- The system enters a loop, where for each iteration:
-
Stage 3: Final Report Generation
- After the iterations complete, the agent synthesizes the final, refined draft, the initial plan, and the full history of questions and answers into a single, comprehensive report.
🛠️ Technology Stack
- Backend Framework: FastAPI
- LLM Serving: vLLM
- Generative LLM:
Qwen/Qwen3-4B-Instruct-2507 - Reranker Model:
tomaarsen/Qwen3-Reranker-0.6B-seq-cls - Retrieval Source: FineWeb Search API
- Containerization: Docker
🚀 Getting Started
Prerequisites
- Docker and Docker Compose
- An NVIDIA GPU with 24GB+ VRAM
- NVIDIA Container Toolkit
1. Configure Environment
First, create a local environment file from the example template. This file will store your API keys.
Now, open .env and add your API keys for:
FINEWEB_API_KEYOPENROUTER_API_KEY(used as a fallback for the generator)
2. Build and Run the Container
We recommend using Docker Compose, which handles building the image and running the services as defined in compose.yml.
docker compose up --build
This command will:
- Build the Docker image from the
Dockerfile. - Start the container.
- Execute the
start.shscript, which first launches the vLLM OpenAI-compatible server in the background to serve the Qwen models. - After a brief pause to allow the models to load, it starts the FastAPI application on port
5053.
Your API is now running and accessible at http://localhost:5053.
✅ Testing Your Implementation
You can verify that your service is compliant with the competition requirements using the provided local_test.py script.
uv sync source venv/bin/activate # Test both the /run and /evaluate endpoints (full test) python local_test.py --base-url http://localhost:5053 # Test only the dynamic /run endpoint python local_test.py --base-url http://localhost:5053 --test-mode run # Test only the static /evaluate endpoint python local_test.py --base-url http://localhost:5053 --test-mode evaluate
A successful run will confirm that both endpoints are functioning correctly and that the result.jsonl file is generated as expected for the static evaluation.
📋 API Endpoints
-
Health Check:
GET /health- A simple endpoint to confirm the service is running. Returns
{"status": "ok"}.
- A simple endpoint to confirm the service is running. Returns
-
Dynamic Evaluation:
POST /run- Input:
{"question": "string"} - Output: A Server-Sent Events (SSE) stream that provides real-time updates on the agent's progress, including intermediate steps, citations, and the final report.
- Input:
-
Static Evaluation:
POST /evaluate- Input:
{"query": "string", "iid": "string"} - Output: A single JSON response
{"query_id": "string", "generated_response": "string"}.
- Input:
🚢 Competition Submission
The following AWS CLI commands are provided for pushing your final Docker image to the competition's ECR repository.
-
Sign in to AWS ECR
aws ecr get-login-password --region us-east-1 | docker login --username AWS --password-stdin <your-aws-account-id>.dkr.ecr.us-east-1.amazonaws.com
-
Build the Image (if not already built) Ensure you build for the correct platform.
docker build --platform linux/amd64 -t ttt-dr:latest . -
Tag the Image for ECR
docker tag ttt-dr:latest <your-aws-account-id>.dkr.ecr.us-east-1.amazonaws.com/neurips2025text/ttt-dr:latest
-
Push the Image to ECR
docker push <your-aws-account-id>.dkr.ecr.us-east-1.amazonaws.com/neurips2025text/ttt-dr:latest