Benchmarking GPT-5

9 points by aravindputrevu a year ago · 1 comment

Reader

We put GPT-5 through our Golden PR Dataset.

Here is the TL;DR

- GPT-5 outperformed Opus-4, Sonnet-4, and OpenAI’s O3 across a battery of 300 varying difficulty, error-diverse pull requests.

- GPT-5 scored highest on our comprehensive test and found 254 out of 300 bugs or 85% where other models found between 200 and 207 – 16% to 22% less.

- On our 25 hardest PRs from our evaluation dataset, GPT-5 achieved the highest ever overall pass rate (77.3%), representing a 190% improvement over Sonnet-4, 132% over Opus-4, and 76% over O3.

Settings

Benchmarking GPT-5

Keyboard Shortcuts