Last updated: Jun 20, 2026
The goal of this tracker is to detect statistically significant degradations in Codex with gpt-5.5-xhigh performance on SWE tasks.
- • Updated daily: Daily benchmarks on a curated subset of SWE-Bench-Pro
- • Detect degradation: Statistical testing for degradation detection
- • What you see is what you get: We benchmark in Codex CLI with gpt-5.5-xhigh directly, no custom harnesses.
Summary
51 %
50 eval test cases ran
54 %
300 eval test cases ran
54 %
1,200 eval test cases ran
Daily Trend
Pass rate over time
Toggle 95% CI to view uncertainty around each point.
Pass Rate
Daily benchmark pass rate showing the percentage of tasks solved each day.
Baseline
Historical average pass rate (56%) used as reference for detecting performance changes.
Threshold
Shaded region around baseline (±13.3%). Changes within this band are not statistically significant (p ≥ 0.05).
Dashed line at 56% baseline with ±13.3% significance threshold
Weekly Trend
Aggregated 7-day pass rate
The same uncertainty toggle applies here for 7-day windows.
Pass Rate
7-day rolling pass rate aggregating daily results for a smoother trend view.
Baseline
Historical average pass rate (56%) used as reference for detecting performance changes.
Threshold
Shaded region around baseline (±5.1%). Changes within this band are not statistically significant (p ≥ 0.05).
Dashed line at 56% baseline with ±5.1% significance threshold