METR's time-horizon of coding tasks does not mean what you think it means

killerstorm.github.io

1 points by killerstorm a month ago · 1 comment

Reader

tl;dr: If calculate "the human time horizon using the same methodology as we do for models", it's only 1.5 hours @ 50% success rate for the baseline experts METR hired, and it was surpassed by o3 in April 2025, 6 months ahead METR's prediction.

METR considers this "raw baseline" largely irrelevant as it might be affected by people getting bored / not paid enough, etc. But they admit this introduces a bias which makes reported numbers less relevant for human-vs-AI comparison.

Settings

METR's time-horizon of coding tasks does not mean what you think it means

Keyboard Shortcuts