Absolute Zero: Reinforced Self-Play Reasoning with Zero Data

arxiv.org

86 points by leodriesch 3 days ago


a2128 - 3 days ago

To be clear, this is not a model trained on zero data, this is a pretrained model (Qwen 2.5 trained on 18 trillion tokens) finetuned using self-generated data grounded by a Python interpreter

macrolime - 3 days ago

Pretty sure OpenAI and/or DeepMind have already been doing something very similar for a while already, just without publishing it.

gitroom - 3 days ago

sometimes i feel like the whole self-play thing is kinda the obvious path now but still nuts seeing it actually work better than huge data dumps. you ever wonder how much of progress is just crazy good pipelines versus actual breakthroughs?

Waterluvian - 3 days ago

Related to this: has anyone seen a model respond with “oh wait I was wrong…” when you follow-up with a “can you explain why this answer is right?”

I still find that my uses of GPT and others still struggle with a sort of tunnel vision.

squillion - 3 days ago

Warning: abuse of this technique may cause the model to go blind.

nullc - 2 days ago

Be nice to see some of these run on languages the pretrained model is a little less good at than Python and JS.

QuadmasterXLII - 3 days ago

For everyone who says “modern incentives forbid publishing negative results,” let this stand as a counterexample!

mentalgear - 3 days ago

"Despite using zero human-curated data, AZR achieves state-of-the-art results on diverse coding and math reasoning benchmarks, even outperforming models trained on large in-domain datasets. This demonstrates the potential for sophisticated reasoning skills to emerge purely through self-play without domain-specific supervision."