Settings

Theme

Agentic coding improves ARC AGI 2 performance across models

pivotools.github.io

1 points by steinsgate 11 days ago · 1 comment

Reader

steinsgateOP 11 days ago

We found something surprising about ARC AGI 2: the benchmark aiming to measure human-like fluid intelligence. Just enabling a stateful Python tool boosts performance across models. We got > 4x performance improvement in GPT OSS 120B (high). The effect continues well into frontier territory (GPT 5.2) with double digit gains.

We aren't sure whether these gains happen because code execution is a stronger form of verification compared to pure CoT or because it encourages qualitatively different thinking patterns.

Another interesting finding: interleaved thinking, the model capability behind these gains, seems fragile at the infra/client layer. Soft failures can make capable models look much worse than they actually are.

Keyboard Shortcuts

j
Next item
k
Previous item
o / Enter
Open selected item
?
Show this help
Esc
Close modal / clear selection