GitHub - NTT123/auto-research-engineer

A Claude Code setup that implements ML papers from arxiv. Give it a paper, and it orchestrates a team of persistent AI agents to read the paper, plan the implementation, write the code, verify correctness, optimize performance, train, and compare results against the paper's claims.

Usage

# Install tmux first if needed:
# macOS: brew install tmux
# Ubuntu/Debian: sudo apt install tmux

tmux new-session claude --dangerously-skip-permissions --setting-sources project "implement 2307.08691"

Or paste an arxiv URL or describe a paper. Claude handles the rest. tmux is required for agent teams to work.

Requirements

Claude Code with agent teams enabled
tmux
uv (Python package manager)
GPU recommended for training phases

How it works

Claude acts as team lead, creating one persistent team per paper. Teammates stay alive across phases, accumulating context and communicating directly with each other.

Agent team

Role	Phases	Responsibilities
researcher	1-8	Paper expert, writes exploration notes, critiques plan and docs
researcher-critic	1 only	Critiques researcher's notes (temporary)
model-engineer	2-8	Drafts architecture plan, implements model code
data-engineer	3-8	Data loading, preprocessing, batching
training-engineer	3-8	Training loop, logging, checkpointing, runs full training
tester	4-8	Correctness verification against paper
performance-engineer	5-8	Profiling and optimization
eval-engineer	7-8	Evaluation pipeline, results comparison
documentation-engineer	8	Writes project README

Phases

Paper exploration — researcher reads the paper thoroughly, writes structured notes; researcher-critic reviews via critique loop
Planning — model-engineer drafts PLAN.md; researcher critiques
Implementation — model/data/training engineers implement in parallel with dependency tracking
Verification — tester checks code against paper equations and algorithms, files bugs directly with engineers
Performance optimization — iterative profile-fix-measure cycles
Training — full training run with pre-authorized emergency actions and anomaly monitoring
Results verification — eval-engineer compares outputs against paper claims
Documentation — documentation-engineer writes README; researcher reviews

Key documents (notes, plan, README) go through a critique loop: author drafts, reviewer critiques, max 2 revision rounds. Each phase produces notes, decisions, and git commits.

See CLAUDE.md for the full protocol.

Project structure

workspace/<paper-name>/
├── paper/          # downloaded paper (TeX preferred, then HTML, then PDF)
├── notes/
│   ├── decisions.md    # key decisions log
│   ├── note-NN-*.md    # diary entries
│   └── task-*.md       # work orders for teammates
├── src/            # implementation code
├── PLAN.md         # implementation plan
├── README.md       # project documentation
└── pyproject.toml