NanoGPT Slowrun - Q

3 min read Original article ↗

Compute grows much faster than data . Our current scaling laws require proportional increases in both to scale . But the asymmetry in their growth means intelligence will eventually be bottlenecked by data, not compute. This is easy to see if you look at almost anything other than language models. In robotics and biology, the massive data requirement leads to weak models, and both fields have enough economic incentives to leverage 1000x more compute if that led to significantly better results. But they can't, because nobody knows how to scale with compute alone without adding more data. The solution is to build new learning algorithms that work in limited data, practically infinite compute settings. This is what we are solving at Q Labs: our goal is to understand and solve generalization.

Slowrun baseline on 100M tokens
NanoGPT Slowrun baseline: 2.4x data efficiency

Last week we released NanoGPT Slowrun , an open repo for data-efficient learning algorithms. The rules are simple: train on 100M tokens from FineWeb, use as much compute as you want, lowest validation loss wins. Improvements are submitted as PRs to the repo and merged if they lower val loss. The constraint is the inverse of speedruns like modded-nanogpt , which optimize wall-clock time. Those benchmarks have been hugely productive, but optimizing for speed filters out expensive ideas: heavy regularization, second-order optimizers, gradient descent alternatives. Slowrun is built for exactly those ideas.

What we've found so far

Muon outperforms every optimizer we tested (AdamW, SOAP, MAGMA). Multi-epoch training matters. And following work by Kotha et al. , scaling to large parameter counts works if you pair it with aggressive regularization -- weight decay up to 16x standard, plus dropout. The baseline sits at ~2.4x data efficiency against modded-nanogpt.

Update: 5.5x Data Efficiency

Since the initial release, community contributions have pushed data efficiency from ~2.4x to 5.5x against modded-nanogpt, more than doubling in a few days. The key changes are: shuffling at the start of each epoch, which had outsized impact on multi-epoch training; learned projections for value embeddings instead of separate embedding tables; swapping squared ReLU for SwiGLU activation; and ensembling multiple models. 10x data efficiency seems reachable in the short term. 100x might be feasible by the end of the year, given how many directions remain unexplored, but it will require serious exploration on the algorithms side.

Updated Slowrun
Updates to 5.5x Data Efficiency

Directions we think are wide open

  • Second-order optimizers and natural gradient methods
  • Diffusion models
  • Curriculum learning
  • Gradient descent alternatives like evolutionary search
  • Optimizing for compression/model-complexity

If you're working on any of this or something we haven't thought of, open an issue on the repo, or email research@qlabs.sh.

← Back to Q