GitHub - vivekvkashyap/autoresearch-rl

4 min read Original article ↗

If you've ever tried RL post-training, you know the pain. You tweak the learning rate, kick off a run, wait 20 minutes, check the eval, see it collapsed, roll back, try something else. Repeat until you lose your mind or your GPU reservation expires. The whole thing is fragile, a slightly wrong clipping ratio or one bad KL penalty and your model forgets how to write coherent sentences. It's not like pretraining where you can kind of set it and forget it. RL is just... unstable by nature.

So when I saw Andrej Karpathy's autoresearch, where he let an AI agent autonomously run pretraining experiments overnight, the first thing that came to mind was: why not do this for RL post-training? That's where the real pain is. That's where you actually need the help.

That's what this is. You point an AI agent at a real RL training setup, go to sleep, and wake up to a log of 60+ experiments it ran while you were gone. Each one modifies the config, trains for 10 minutes, checks if evals improved, keeps or discards, and moves on to the next idea. No babysitting. No manual hyperparameter sweeps. Just let it cook.

Built on prime-rl, honestly my favourite RL post-training framework out there, and verifiers for reward verification.

Shoutout to @willccbb for creating verifiers and making it opensource.

Progress

Autoresearch-RL Progress

How it works

The repo has four files that matter:

  • prepare.py - fixed constants, one-time setup (downloads base model, verifies GPUs). Not modified.
  • train.toml - the single file the agent edits. Contains the full RL training configuration: optimizer, learning rate, loss function, environments, rollout settings, etc. This file is edited and iterated on by the agent.
  • run.py - experiment runner. Launches prime-rl, enforces the time budget, extracts metrics. Not modified.
  • program.md - instructions for the agent. This file is edited and iterated on by the human.

By design, training runs for a fixed 10-minute time budget. The metric is eval_score (average pass@1 across environments), higher is better.

Quick start

Requirements: 2 NVIDIA GPUs, Python 3.12+, uv.

# 1. Install uv (if you don't already have it)
curl -LsSf https://astral.sh/uv/install.sh | sh

# 2. Install dependencies (includes prime-rl, verifiers, vllm, torch)
uv sync

# 3. Install flash-attn (required by prime-rl trainer, needs CUDA toolkit)
uv sync --extra flash-attn

# 4. One-time setup (download model, verify GPUs)
uv run prepare.py

# 5. Run a single experiment (~10 min)
uv run run.py

Running the agent

Spin up Claude/Codex in this repo, then prompt:

Hi have a look at program.md and let's kick off a new experiment! let's do the setup first.

Design choices

  • Single file to modify. The agent only touches train.toml. Diffs are easy to review.
  • Fixed time budget. Training always runs for 10 minutes, making experiments directly comparable.
  • 2-GPU setup. GPU 0 runs vLLM inference, GPU 1 runs the RL trainer. No memory contention.
  • Multiple environments. GSM8K by default. Composite metric prevents overfitting to one task.
  • Built on prime-rl. Production-ready async RL framework with GRPO/IPO/DPPO support.

Best configuration found

After 66 experiments, the agent converged on this config (train.toml), going from a baseline of 0.475 to 0.550 eval_score on GSM8K (pass@1):

Parameter Value Why it matters
lr 3e-6 Lower than the initial 5e-6, prevents overfitting over 20 steps
batch_size 256 Smaller batches = more gradient updates in the same time budget
rollouts_per_example 4 Fewer rollouts per example, but more steps overall
max_steps 20 Sweet spot, 14 was too few, 24 overfits
token_mask_high 5.0 Clips extreme importance ratios at the token level
length_weighted_mean true Normalizes loss by response length, helps with variable-length outputs
scheduler constant No warmup, no decay, just constant LR for 20 steps
optimizer adamw Default, but confirmed better than SGD and Muon
temperature 1.0 Default sampling temp, 0.7 and 1.2 both hurt

Key things that didn't work: LoRA (rank 16), cosine/linear schedules, higher LR (5e-6+), rollouts=8 (no gain over 4), difficulty filtering, repetition penalty, and torch.compile.

License

MIT