Agentic coding improves ARC AGI 2 performance across models

pivotools.github.io

1 points by steinsgate 11 days ago · 1 comment

Reader

We found something surprising about ARC AGI 2: the benchmark aiming to measure human-like fluid intelligence. Just enabling a stateful Python tool boosts performance across models. We got > 4x performance improvement in GPT OSS 120B (high). The effect continues well into frontier territory (GPT 5.2) with double digit gains.

We aren't sure whether these gains happen because code execution is a stronger form of verification compared to pure CoT or because it encourages qualitatively different thinking patterns.

Another interesting finding: interleaved thinking, the model capability behind these gains, seems fragile at the infra/client layer. Soft failures can make capable models look much worse than they actually are.

Settings

Agentic coding improves ARC AGI 2 performance across models

Keyboard Shortcuts