🚨New Meta Superintelligence Labs Paper🚨 What do we do when we don’t have reference answers for RL? What if annotations are too expensive or unknown? Compute as Teacher (CaT🐈) turns inference compute into a post-training supervision signal. CaT improves up to 30% even on non-verifiable domains (HealthBench) across 3 model families. In CaT, the model self-synthesises existing GRPO rollouts into a high quality answer, reconciling disagreements, partial solutions, and facts. We use this estimated reference to reward each individual rollout, converting more inference compute into supervision for RL! As training progresses, the quality of rollouts, and the estimate reference, keep on increasing, with the model learning from experience 🔄 Without any thinking tokens, across Llama, Gemma, Qwen we find large improvements on both MATH and HealthBench! How does CaT work for non-verifiable domains? A frozen copy of the model generates a rubric from the estimated reference, allowing automated rewards with an LLM that checks each criterion. We show this outperforms both SFT and direct LLM judgements. As an inference-time technique, the synthesis in CaT outperforms majority voting and self-selected best-of-N. This is because CaT doesn't just select, it reconciles. It can even disagree with answers of individual rollouts, combining them to produce a better one! Beyond synthesis, generating supervision opens the door to learning from any inference-time methods we develop in the future! It's bitter-lesson friendly 📈 Paper, and thread below 👇
