GitHub - ArmanJR-Lab/autoautoresearch: (mad) AI agents running research on single-GPU nanochat training automatically

13 min read Original article ↗

(auto) autoresearch

This is the search, on top of Karpathy's hyperparameter search. Basically, it’s a search for a search.

My Idea

Karpathy's autoresearch is brilliant. It works, but there is a catch: the "Blank Page Problem" of AI. LLMs are fundamentally reactive engines. They are designed to follow, not to lead. If the leader (the user) is asleep, the follower (the agent) defaults to the "average" of all its knowledge, which is mediocre and boring, and gravitates toward safe, incremental moves.

But move 37 was neither safe nor average.

To fix this, we need another piece of software that acts as a "Chaos Monkey" or a "Creative Director." Think of it as the mutation rate in a genetic algorithm; without it, you can get stuck in a local minimum forever.

This repo is my attempt to build that monkey. This is a search for a search.

Architecture

Each experiment method lives in its own directory with train.py, prepare.py, and program.md. The baseline/ directory is Karpathy's vanilla autoresearch loop. Other directories (e.g. mad-scientist/) add a director — a Go binary that generates creative research directives before each experiment iteration. The director works in three steps:

  1. Summarize the current train.py via DeepSeek Chat (so it always knows the actual code state)
  2. Fetch a random ML paper abstract from arxiv (external novelty injection)
  3. Generate a directive via DeepSeek Reasoner, combining code summary + experiment history + paper into one specific idea

The output is a tentative suggestion ("I think you could try...") so the upstream agent reasons about it rather than blindly executing.

I chose Go because I wanted the running agent to have absolutely zero context about the director and to see it as a black box. Just a binary that spits out ideas.

Structure

baseline/                 # vanilla autoresearch (control group)
  train.py, prepare.py, program.md
  .claude/                # agent confinement (hooks + settings)

mad-scientist/            # experiment with director-driven exploration
  train.py, prepare.py, program.md, director, .env
  .claude/                # agent confinement (hooks + settings)

mad-scientist-3-11/       # second director run (full paper summaries, lower temp)
  train.py, prepare.py, program.md, director, .env
  .claude/                # agent confinement (hooks + settings)

director/                 # director source code
  main.go
  configs/*.json          # per-experiment config (prompts, arxiv terms, model, temp)
  logs/api_calls.jsonl    # centralized API call log across all experiments
  .env                    # DEEPSEEK_API_KEY

results/                  # tracking (gitignored)
  <method>/<run-id>/results.tsv

analysis.ipynb            # cross-method comparison (envelope, terminal perf, stall)
Makefile

Current Experiments

Name Director Description
baseline None Vanilla autoresearch. The agent decides what to try next on its own. Control group.
mad-scientist DeepSeek Reasoner (temp 1.2) Summarizes current code, reads experiment history, fetches a random ML paper abstract from arxiv, then generates a bold directive framed as a suggestion. Combines code awareness + historical context + external novelty.
mad-scientist-3-11 DeepSeek Reasoner (temp 1.0) Iteration on mad-scientist: feeds full paper summaries (not just abstracts) via DeepSeek-summarized ar5iv fetches, narrower arxiv categories (cs.LG + cs.CL only), stricter system prompt with scale-awareness rules and history-dedup guidance.

Director config differences: mad-scientist vs mad-scientist-3-11

mad-scientist mad-scientist-3-11
Temperature 1.2 1.0
Paper input Abstract only ({{paper_abstract}}) Full summary ({{paper_abstract}} + {{paper_summary}})
arxiv categories 5 (cs.LG, cs.CL, cs.AI, cs.CV, stat.ML) 2 (cs.LG, cs.CL)
System prompt Generic "small GPT", loose rules Specifies ~10M params, stricter rules (read history, one thing at a time, scale-aware)
Max response Not specified 3 paragraphs

Mad-Scientist Runs vs Original Baseline (Karpathy's H100)

Relative improvement trajectory

original-baseline mad-scientist mad-scientist-3-11
Experiments 126 96 96
Keeps (rate) 23 (18.3%) 22 (22.9%) 16 (16.7%)
Total improvement 2.83% 3.94% 2.79%
Improvement/experiment 0.0224% 0.0411% 0.0291%
Longest stall 25 15 31
Best BPB 0.969686 1.286357 1.302036

mad-scientist remains the strongest run — highest keep rate, most total improvement, and shortest stalls. mad-scientist-3-11 underperformed despite having richer paper context: lower keep rate (16.7% vs 22.9%), less total improvement (2.79% vs 3.94%), and a 31-iteration stall at the end where it couldn't break past 1.302. The tighter prompt and lower temperature may have over-constrained the director, producing more conservative suggestions that failed to escape local minima.

Key moments from mad-scientist (run 1)

The biggest single jump came at iteration 44 — the agent removed the logit softcap (tanh clamping at ±15), dropping val_bpb from 1.309 to 1.299 in one step. Notably, the director had suggested something entirely different (linear attention to break an 8-iteration stall), but the researcher agent ignored it and made a simpler, more effective change on its own. The director's value here was indirect: by flagging the stall and pushing for radical action, it prompted the agent to re-examine the code and spot the softcap as dead weight — something it had overlooked for 43 iterations.

The second jump came at iteration 67 — this time the agent followed the director's advice almost exactly. After another 8-iteration stall (best stuck at 1.294), the director suggested reducing attention heads from 6 to 4 while increasing HEAD_DIM from 64 to 96 to preserve the embedding dimension. No paper was fetched; this was purely the director's own reasoning. The agent implemented it verbatim (val_bpb 1.294 → 1.290), then kept pushing the same idea through iterations 68–69 (HEAD_DIM=128 → 192), reaching 1.287 before single-head attention went too far. Both jumps were triggered by the same stall-detection mechanism ("8 consecutive failures → go radical"), but with opposite dynamics: iteration 44 succeeded by ignoring the director, iteration 67 by following it.

Key moments from mad-scientist-3-11 (run 2)

The biggest breakthrough came at iteration 43 — replacing ReLU² with SwiGLU activation (iso-parameter), jumping from 1.316 to 1.306. This was followed by a productive stretch (iterations 47–65) of incremental optimizer tuning that pushed BPB down to 1.302. After iteration 65, the run hit a wall: 31 consecutive experiments without improvement, exploring attention tweaks, initialization changes, and architectural modifications — none broke through.

Commands

# List available director configs
make list

# Build + deploy director for an experiment (default: linux/arm64 for NVIDIA Jetson)
make deploy EXPERIMENT=mad-scientist

# Build for local macOS instead
make deploy EXPERIMENT=mad-scientist GOOS=darwin GOARCH=arm64

# Run the director (from inside an experiment directory)
./director --verbose

# Run training (from inside an experiment directory)
python3 train.py

Adding a new experiment method

  1. Copy baseline/ to a new directory
  2. Create director/configs/<name>.json with custom system prompt, user prompt, arxiv terms, model, temperature
  3. make deploy EXPERIMENT=<name>
  4. Edit program.md in the new directory to integrate the director into the loop

Hardware

I'm GPU poor. I only have an NVIDIA Jetson AGX Orin to play with. Let's port autoresearch to that.

Jetson AGX Orin port

This fork has been adapted to run on an NVIDIA Jetson AGX Orin 32GB (JetPack 6, CUDA 12.6, PyTorch 2.10). The key changes from upstream:

  • Replaced Flash Attention 3 with PyTorch's built-in scaled_dot_product_attention (FA3/kernels package targets Hopper/desktop Ampere and doesn't build on Jetson's SM 8.7).
  • Removed torch.compile — Triton is not available on aarch64, so the inductor backend fails. Model and optimizer run in eager mode.
  • Removed kernels dependency and the pinned torch CUDA index from pyproject.toml (Jetson uses its own JetPack-provided PyTorch).
  • Tuned hyperparameters for the Orin's ~5–24 TFLOPS BF16 throughput (size-dependent) and 30 GB unified memory.

Setup on Jetson

# Install deps (torch is already provided by JetPack)
pip3 install rustbpe tiktoken pyarrow requests numpy pandas matplotlib

# Clone and prep
git clone <repo-url> && cd autoresearch
python3 prepare.py    # download data + train tokenizer
python3 train.py      # baseline run (~5 min)

Hyperparameter sweep results

All runs use the fixed 5-minute time budget on a Jetson AGX Orin 32GB with MAX_SEQ_LEN=512, HEAD_DIM=64, WINDOW_PATTERN="L":

DEPTH DEVICE_BATCH_SIZE TOTAL_BATCH_SIZE val_bpb Steps Params VRAM
4 16 2^15 1.488 872 11.5M 1.2 GB
8 32 2^17 1.519 95 50.3M 6.2 GB
6 32 2^17 1.419 147 26.3M 4.4 GB
6 32 2^16 1.357 279 26.3M 4.4 GB
6 32 2^15 1.341 535 26.3M 4.4 GB
6 32 2^14 1.338 1018 26.3M 4.3 GB

The best configuration is DEPTH=6, DEVICE_BATCH_SIZE=32, TOTAL_BATCH_SIZE=2^14 (the current defaults in this fork). Key takeaways:

  • DEPTH=8 is too large — the 50M param model only gets 95 steps, not enough to converge in 5 minutes.
  • DEPTH=6 (26.3M params) is the sweet spot — enough capacity while still allowing hundreds of steps.
  • Smaller batch sizes win — on this hardware, more optimizer steps matters more than bigger batches. The improvement plateaus around 2^14 (grad_accum=1).
  • Only 4.3 GB VRAM used out of 30 GB available, leaving plenty of room for the autonomous agent to experiment with larger architectures.

autoresearch

teaser

One day, frontier AI research used to be done by meat computers in between eating, sleeping, having other fun, and synchronizing once in a while using sound wave interconnect in the ritual of "group meeting". That era is long gone. Research is now entirely the domain of autonomous swarms of AI agents running across compute cluster megastructures in the skies. The agents claim that we are now in the 10,205th generation of the code base, in any case no one could tell if that's right or wrong as the "code" is now a self-modifying binary that has grown beyond human comprehension. This repo is the story of how it all began. -@karpathy, March 2026.

The idea: give an AI agent a small but real LLM training setup and let it experiment autonomously overnight. It modifies the code, trains for 5 minutes, checks if the result improved, keeps or discards, and repeats. You wake up in the morning to a log of experiments and (hopefully) a better model. The training code here is a simplified single-GPU implementation of nanochat. The core idea is that you're not touching any of the Python files like you normally would as a researcher. Instead, you are programming the program.md Markdown files that provide context to the AI agents and set up your autonomous research org. The default program.md in this repo is intentionally kept as a bare bones baseline, though it's obvious how one would iterate on it over time to find the "research org code" that achieves the fastest research progress, how you'd add more agents to the mix, etc. A bit more context on this project is here in this tweet.

How it works

The repo is deliberately kept small and only really has a three files that matter:

  • prepare.py — fixed constants, one-time data prep (downloads training data, trains a BPE tokenizer), and runtime utilities (dataloader, evaluation). Not modified.
  • train.py — the single file the agent edits. Contains the full GPT model, optimizer (Muon + AdamW), and training loop. Everything is fair game: architecture, hyperparameters, optimizer, batch size, etc. This file is edited and iterated on by the agent.
  • program.md — baseline instructions for one agent. Point your agent here and let it go. This file is edited and iterated on by the human.

By design, training runs for a fixed 5-minute time budget (wall clock, excluding startup/compilation), regardless of the details of your compute. The metric is val_bpb (validation bits per byte) — lower is better, and vocab-size-independent so architectural changes are fairly compared.

Quick start

Requirements: A single NVIDIA GPU (tested on H100), Python 3.10+, uv.

# 1. Install uv project manager (if you don't already have it)
curl -LsSf https://astral.sh/uv/install.sh | sh

# 2. Install dependencies
uv sync

# 3. Download data and train tokenizer (one-time, ~2 min)
uv run prepare.py

# 4. Manually run a single training experiment (~5 min)
uv run train.py

If the above commands all work ok, your setup is working and you can go into autonomous research mode.

Running the agent

Simply spin up your Claude/Codex or whatever you want in this repo (and disable all permissions), then you can prompt something like:

Each experiment directory includes a .claude/ folder with a PreToolUse hook (cage.sh) that prevents the agent from reading or writing files outside its own directory. This ensures experiments stay isolated and agents can't accidentally clobber each other's files.

Hi have a look at program.md and let's kick off a new experiment! let's do the setup first.

The program.md file is essentially a super lightweight "skill".

Project structure

prepare.py      — constants, data prep + runtime utilities (do not modify)
train.py        — model, optimizer, training loop (agent modifies this)
program.md      — agent instructions
pyproject.toml  — dependencies

Design choices

  • Single file to modify. The agent only touches train.py. This keeps the scope manageable and diffs reviewable.
  • Fixed time budget. Training always runs for exactly 5 minutes, regardless of your specific platform. This means you can expect approx 12 experiments/hour and approx 100 experiments while you sleep. There are two upsides of this design decision. First, this makes experiments directly comparable regardless of what the agent changes (model size, batch size, architecture, etc). Second, this means that autoresearch will find the most optimal model for your platform in that time budget. The downside is that your runs (and results) become not comparable to other people running on other compute platforms.
  • Self-contained. No external dependencies beyond PyTorch and a few small packages. No distributed training, no complex configs. One GPU, one file, one metric.

Platform support

This code currently requires that you have a single NVIDIA GPU. In principle it is quite possible to support CPU, MPS and other platforms but this would also bloat the code. I'm not 100% sure that I want to take this on personally right now. People can reference (or have their agents reference) the full/parent nanochat repository that has wider platform support and shows the various solutions (e.g. a Flash Attention 3 kernels fallback implementation, generic device support, autodetection, etc.), feel free to create forks or discussions for other platforms and I'm happy to link to them here in the README in some new notable forks section or etc.

Seeing as there seems to be a lot of interest in tinkering with autoresearch on much smaller compute platforms than an H100, a few extra words. If you're going to try running autoresearch on smaller computers (Macbooks etc.), I'd recommend one of the forks below. On top of this, here are some recommendations for how to tune the defaults for much smaller models for aspiring forks:

  1. To get half-decent results I'd use a dataset with a lot less entropy, e.g. this TinyStories dataset. These are GPT-4 generated short stories. Because the data is a lot narrower in scope, you will see reasonable results with a lot smaller models (if you try to sample from them after training).
  2. You might experiment with decreasing vocab_size, e.g. from 8192 down to 4096, 2048, 1024, or even - simply byte-level tokenizer with 256 possibly bytes after utf-8 encoding.
  3. In prepare.py, you'll want to lower MAX_SEQ_LEN a lot, depending on the computer even down to 256 etc. As you lower MAX_SEQ_LEN, you may want to experiment with increasing DEVICE_BATCH_SIZE in train.py slightly to compensate. The number of tokens per fwd/bwd pass is the product of these two.
  4. Also in prepare.py, you'll want to decrease EVAL_TOKENS so that your validation loss is evaluated on a lot less data.
  5. In train.py, the primary single knob that controls model complexity is the DEPTH (default 8, here). A lot of variables are just functions of this, so e.g. lower it down to e.g. 4.
  6. You'll want to most likely use WINDOW_PATTERN of just "L", because "SSSL" uses alternating banded attention pattern that may be very inefficient for you. Try it.
  7. You'll want to lower TOTAL_BATCH_SIZE a lot, but keep it powers of 2, e.g. down to 2**14 (~16K) or so even, hard to tell.

I think these would be the reasonable hyperparameters to play with. Ask your favorite coding agent for help and copy paste them this guide, as well as the full source code.

Notable forks

License

MIT