Show HN: Solving ARC AGI 2 with interleaved thinking and stateful IPython REPL

2 points by steinsgate 3 days ago · 0 comments · 2 min read

Reader

My friends and I started this project in the summer of 2025 with the initial goal of participating in the ARC Prize Kaggle competition. Early on, we were exploring agentic coding with frontier reasoning models and found that models like o3 and o4-mini could generate high-quality synthetic ARC-style puzzles. Our plan was to use these synthetic puzzles to train a smaller model via agentic reinforcement learning (RLVR with interleaved thinking).

To bootstrap this process, we needed successful solution traces from an open-weight reasoning model for cold-start supervised fine-tuning. That requirement led us to investigate GPT-OSS-120B. While doing so, we noticed something unexpected: simply placing the model into the interleaved thinking regime produced large and consistent score improvements on ARC AGI 2 tasks. We were seeing scores that we didn’t think was possible for a medium sized OSS model.

This observation ultimately shifted the focus of our work as we wanted to find out how universally this observation applies while staying within our resource constraints. We concluded that it applies quite generally, with double digit gains in frontier models too.

Previously, I have read debates about whether ARC AGI 2 is primarily a reasoning benchmark or a visual benchmark. I guess we can now add agentic benchmark to the mix as well!

No comments yet.

Settings

Show HN: Solving ARC AGI 2 with interleaved thinking and stateful IPython REPL

Keyboard Shortcuts