Codex gpt-5.5-xhigh Performance Tracker | Marginlab

2 min read Original article ↗

Last updated: Jun 20, 2026

The goal of this tracker is to detect statistically significant degradations in Codex with gpt-5.5-xhigh performance on SWE tasks.

  • Updated daily: Daily benchmarks on a curated subset of SWE-Bench-Pro
  • Detect degradation: Statistical testing for degradation detection
  • What you see is what you get: We benchmark in Codex CLI with gpt-5.5-xhigh directly, no custom harnesses.

Powered by MARGIN EVALS

Summary

51 %

50 eval test cases ran

54 %

300 eval test cases ran

54 %

1,200 eval test cases ran

Daily Trend

Pass rate over time

Toggle 95% CI to view uncertainty around each point.

Pass Rate

Daily benchmark pass rate showing the percentage of tasks solved each day.

Baseline

Historical average pass rate (56%) used as reference for detecting performance changes.

Threshold

Shaded region around baseline (±13.3%). Changes within this band are not statistically significant (p ≥ 0.05).

95% confidence interval for each data point. Toggle checkbox to show/hide. Wider intervals indicate more uncertainty (fewer samples).

Dashed line at 56% baseline with ±13.3% significance threshold

Weekly Trend

Aggregated 7-day pass rate

The same uncertainty toggle applies here for 7-day windows.

Pass Rate

7-day rolling pass rate aggregating daily results for a smoother trend view.

Baseline

Historical average pass rate (56%) used as reference for detecting performance changes.

Threshold

Shaded region around baseline (±5.1%). Changes within this band are not statistically significant (p ≥ 0.05).

95% confidence interval for each data point. Toggle checkbox to show/hide. Wider intervals indicate more uncertainty (fewer samples).

Dashed line at 56% baseline with ±5.1% significance threshold