Codex gpt-5.4-xhigh Performance Tracker | Marginlab

2 min read Original article ↗

Last updated: Mar 18, 2026

We are collecting baseline data for Codex with gpt-5.4-xhigh on SWE tasks before degradation detection resumes.

  • Updated daily: Daily benchmarks on a curated subset of SWE-Bench-Pro
  • Collecting baseline: Degradation detection is paused while the new model baseline is established
  • What you see is what you get: We benchmark in Codex CLI with gpt-5.4-xhigh directly, no custom harnesses.

New model — collecting baseline data. Degradation detection paused. View historical performance.

Summary

Collecting...

new baseline

56 %

50 eval test cases ran

54 %

350 eval test cases ran

54 %

650 eval test cases ran

Daily Trend

Pass rate over time

Toggle 95% CI to view uncertainty around each point.

Pass Rate

Daily benchmark pass rate showing the percentage of tasks solved each day.

95% confidence interval for each data point. Toggle checkbox to show/hide. Wider intervals indicate more uncertainty (fewer samples).

Enable 95% CI checkbox to show confidence intervals

Weekly Trend

Aggregated 7-day pass rate

The same uncertainty toggle applies here for 7-day windows.

Pass Rate

7-day rolling pass rate aggregating daily results for a smoother trend view.

95% confidence interval for each data point. Toggle checkbox to show/hide. Wider intervals indicate more uncertainty (fewer samples).

Enable 95% CI checkbox to show confidence intervals

Get notified when degradation is detected

We'll email you when we detect a statistically significant performance drop.

Thanks for subscribing! Check your email to confirm.