GitHub - eamag/MMU-RAG-competition: The implementation of Test Time Diffusion paper by Google with some tweaks to run on 24gb gpu

TTD-RAG: A Test-Time Diffusion Framework for the MMU-RAG Competition

This repository contains our submission for the MMU-RAG Competition, a deep research agent named TTD-RAG. Our system is a faithful implementation of the framework proposed in the paper "Deep Researcher with Test-Time Diffusion (TTD-DR)". This README is generated by gemini 2.5.

It conceptualizes report generation as an iterative "denoising" process, starting with a preliminary draft and progressively refining it through cycles of targeted search, synthesis, and revision. This approach is designed to excel at complex, multi-hop reasoning tasks that require coherent, long-form answers.

🎯 Key Features

Test-Time Diffusion Framework: Models research report generation as an iterative process of refining a "noisy" draft with external information, ensuring coherence and reducing information loss.
Report-Level Denoising with Retrieval: Uses an evolving draft to dynamically guide the search process, ensuring each retrieval step is targeted at filling specific knowledge gaps.
Component-wise Self-Evolution: Enhances the quality of each step in the workflow (planning, synthesis) by generating diverse variants, critiquing them, and merging them into a superior output.
High-Performance Serving: Utilizes vLLM to serve both the generative (Qwen/Qwen3-4B-Instruct-2507) and reranking (tomaarsen/Qwen3-Reranker-0.6B-seq-cls) models for high throughput and low latency.
Competition Compliant: Fully supports both dynamic (streaming) and static evaluation endpoints as required by the competition rules, validated with the provided local_test.py script.

⚙️ System Architecture & Workflow

The agent operates in a structured, multi-stage process orchestrated by src/pipeline.py:

Stage 1: Planning & Initial Drafting
- An initial Research Plan is generated to outline the key areas of investigation.
- A preliminary Noisy Draft is created based on the LLM's internal knowledge, serving as the starting point for the diffusion process.
Stage 2: Iterative Search & Denoising
- The system enters a loop, where for each iteration:
  1. A new search query is generated, informed by the current draft's deficiencies and the overall plan.
  2. Documents are retrieved from the FineWeb Search API.
  3. The retrieved documents are chunked and reranked using a specialized model to find the most relevant information.
  4. The top-ranked chunks are synthesized into a concise answer for the search query.
  5. The draft is revised ("denoised") by integrating this new information.
Stage 3: Final Report Generation
- After the iterations complete, the agent synthesizes the final, refined draft, the initial plan, and the full history of questions and answers into a single, comprehensive report.

🛠️ Technology Stack

Backend Framework: FastAPI
LLM Serving: vLLM
Generative LLM: Qwen/Qwen3-4B-Instruct-2507
Reranker Model: tomaarsen/Qwen3-Reranker-0.6B-seq-cls
Retrieval Source: FineWeb Search API
Containerization: Docker

🚀 Getting Started

Prerequisites

Docker and Docker Compose
An NVIDIA GPU with 24GB+ VRAM
NVIDIA Container Toolkit

1. Configure Environment

First, create a local environment file from the example template. This file will store your API keys.

Now, open .env and add your API keys for:

FINEWEB_API_KEY
OPENROUTER_API_KEY (used as a fallback for the generator)

2. Build and Run the Container

We recommend using Docker Compose, which handles building the image and running the services as defined in compose.yml.

docker compose up --build

This command will:

Build the Docker image from the Dockerfile.
Start the container.
Execute the start.sh script, which first launches the vLLM OpenAI-compatible server in the background to serve the Qwen models.
After a brief pause to allow the models to load, it starts the FastAPI application on port 5053.

Your API is now running and accessible at http://localhost:5053.

✅ Testing Your Implementation

You can verify that your service is compliant with the competition requirements using the provided local_test.py script.

uv sync
source venv/bin/activate

# Test both the /run and /evaluate endpoints (full test)
python local_test.py --base-url http://localhost:5053

# Test only the dynamic /run endpoint
python local_test.py --base-url http://localhost:5053 --test-mode run

# Test only the static /evaluate endpoint
python local_test.py --base-url http://localhost:5053 --test-mode evaluate

A successful run will confirm that both endpoints are functioning correctly and that the result.jsonl file is generated as expected for the static evaluation.

📋 API Endpoints

Health Check: GET /health
- A simple endpoint to confirm the service is running. Returns {"status": "ok"}.
Dynamic Evaluation: POST /run
- Input: {"question": "string"}
- Output: A Server-Sent Events (SSE) stream that provides real-time updates on the agent's progress, including intermediate steps, citations, and the final report.
Static Evaluation: POST /evaluate
- Input: {"query": "string", "iid": "string"}
- Output: A single JSON response {"query_id": "string", "generated_response": "string"}.

🚢 Competition Submission

The following AWS CLI commands are provided for pushing your final Docker image to the competition's ECR repository.

Sign in to AWS ECR

aws ecr get-login-password --region us-east-1 | docker login --username AWS --password-stdin <your-aws-account-id>.dkr.ecr.us-east-1.amazonaws.com

Build the Image (if not already built) Ensure you build for the correct platform.
```
docker build --platform linux/amd64 -t ttt-dr:latest .
```

Tag the Image for ECR

docker tag ttt-dr:latest <your-aws-account-id>.dkr.ecr.us-east-1.amazonaws.com/neurips2025text/ttt-dr:latest

Push the Image to ECR

docker push <your-aws-account-id>.dkr.ecr.us-east-1.amazonaws.com/neurips2025text/ttt-dr:latest