Last updated: Mar 18, 2026
We are collecting baseline data for Codex with gpt-5.4-xhigh on SWE tasks before degradation detection resumes.
- • Updated daily: Daily benchmarks on a curated subset of SWE-Bench-Pro
- • Collecting baseline: Degradation detection is paused while the new model baseline is established
- • What you see is what you get: We benchmark in Codex CLI with gpt-5.4-xhigh directly, no custom harnesses.
New model — collecting baseline data. Degradation detection paused. View historical performance.
Summary
Collecting...
new baseline
56 %
50 eval test cases ran
54 %
350 eval test cases ran
54 %
650 eval test cases ran
Daily Trend
Pass rate over time
Toggle 95% CI to view uncertainty around each point.
Pass Rate
Daily benchmark pass rate showing the percentage of tasks solved each day.
Enable 95% CI checkbox to show confidence intervals
Weekly Trend
Aggregated 7-day pass rate
The same uncertainty toggle applies here for 7-day windows.
Pass Rate
7-day rolling pass rate aggregating daily results for a smoother trend view.
Enable 95% CI checkbox to show confidence intervals
Get notified when degradation is detected
We'll email you when we detect a statistically significant performance drop.
Thanks for subscribing! Check your email to confirm.