Sokoban Speedrun
Fastest recipes to RL models to solve Sokoban to a held-out target on one node:
- LLM Track: RL-fine-tune Qwen3-4B-Instruct-2507 from 57% to >80% held-out pass@1 on one 8xH100.
- Non-LLM Track: train a from-scratch agent on a single H100.
Play Sokoban if the task is unfamiliar.
LLM Track
World record history
| # | Record time (h:mm:ss) | FLOPs | Description | Date | Log | held-out pass@1 | Contributors |
|---|---|---|---|---|---|---|---|
| 1 | 1:27:31 | 1.250837e+18 | GRPO, LR 1.6e-6 annealed, 75 steps | 2026-06-17 | llm/records/2026-06-17_01 | 0.891 (CI [0.86, 0.92]) | @JeanKaddour |
Rules
Fastest wall-clock run wins: one run on one 8xH100 node, from training step 1 through the final checkpoint, which must clear the target.
- Target: lower 95% bootstrap CI > 0.80 on llm/datasets/sokoban_eval.jsonl.
- Eval: 8 completions/puzzle, 12,288 tokens, temperature 0.8, top-p 0.95, seed 12345.
- Fixed: model, train set, eval set, reward function, hardware.
- Open: RL algorithm, loss, schedules, engine, parallelism, domain-agnostic rewards, prompt.
- Not allowed: Sokoban-specific hints, heuristics, or few-shot examples.
- Verification: maintainers rerun at a second seed; both runs must clear the target.
Running
cd llm uv sync NODE_GPUS=8 uv run torchrun --standalone --nproc_per_node=3 -m speedrun uv run python -m eval_speedrun --eval-checkpoint outputs/<run>/step_000075 # Modal (modal_app.py rents an 8xH100) uv run modal volume put nanochat-rl-hf datasets/sokoban_train.jsonl /datasets/sokoban_train.jsonl uv run modal volume put nanochat-rl-hf datasets/sokoban_eval.jsonl /datasets/sokoban_eval.jsonl uv run modal run --detach modal_app.py EVAL_CHECKPOINT=latest uv run modal run modal_app.py
Non-LLM Track
World record history
| # | Record time (mm:ss) | Description | Date | Log | held-out pass@1 | Contributors |
|---|---|---|---|---|---|---|
| 1 | 22:24 | cnn-mingru h256 | 2026-06-21 | non_llm/records/2026-06-21_01 | 0.744 (CI [0.72, 0.77]) | @JeanKaddour |
Rules
Fastest wall-clock run wins: one run on a single 1×H100 node, from training step 1 through the first checkpoint whose held-out CI clears the target.
- Target: lower 95% CI on held-out Boxoban solve-rate > 0.70.
- Eval: official DeepMind Boxoban held-out splits (per-level greedy scoring); default
unfiltered/test. - Disjointness: training draws only from the official
unfiltered/trainsplit; eval uses the disjointunfiltered/test. The eval bin's sha256 and every scored level are pinned in the record soverify_record.pyconfirms the eval pool offline. - Open: policy architecture, RL algorithm, optimizer, schedules, implementation.
- Verification:
verify_record.pyre-derives pass@1/CI; maintainers also rerun at a second seed. Both runs must clear the target.
Running
cd non_llm
uv sync
uv run python speedrun_non_llm.py
uv run modal run --detach modal_app_non_llm.pySubmitting a record
- Train, then eval the final checkpoint — logs, source snapshots, and the eval JSON are written automatically.
- Generate the report for the relevant track and fill in the
Ideasection:
LLM track:
cd llm uv run python ../make_record_report.py records/<your-dir>
Non-LLM track:
cd non_llm uv run python ../make_record_report.py records/<your-dir>
- Open a PR adding the record dir + a row in the matching track's world record history. CI runs the track's verifier.
Credits
@joshua-a-harris's nanoRL speedrun, nanochat, modded-nanoGPT, ScaleRL, ReasoningGym for the LLM-track Sokoban env, DeepMind for Boxoban and PufferLib for the efficient boxoban implementation.