Last updated: Apr 30, 2026
The goal of this tracker is to detect statistically significant degradations in Codex with gpt-5.5-xhigh performance on SWE tasks.
- • Updated daily: Daily benchmarks on a curated subset of SWE-Bench-Pro
- • Detect degradation: Statistical testing for degradation detection
- • What you see is what you get: We benchmark in Codex CLI with gpt-5.5-xhigh directly, no custom harnesses.
New model — collecting baseline data. Degradation detection paused. View historical performance.
Summary
Collecting...
new baseline
55 %
50 eval test cases ran
54 %
300 eval test cases ran
55 %
1,300 eval test cases ran
Daily Trend
Pass rate over time
Toggle 95% CI to view uncertainty around each point.
Pass Rate
Daily benchmark pass rate showing the percentage of tasks solved each day.
Enable 95% CI checkbox to show confidence intervals
Weekly Trend
Aggregated 7-day pass rate
The same uncertainty toggle applies here for 7-day windows.
Pass Rate
7-day rolling pass rate aggregating daily results for a smoother trend view.
Enable 95% CI checkbox to show confidence intervals
Get notified when degradation is detected
We'll email you when we detect a statistically significant performance drop.
Thanks for subscribing! Check your email to confirm.