Benchmarks are contaminated.
Your team has no way to tell
if AI is helping — or hurting.
The Redacted Report
Oct 2025Opus 4.5
██████████████
Nov 2025CC 2.0.76
█████████
Dec 2025Codex CLI 0.98
██████████████████
Jan 2026Codex 5.2
███████████
Feb 2026CC 2.1.37
████████████████
Feb 2026Opus 4.6Codex 5.3
████████████
no data collected
“is Claude broken today?”
“vibes are off”
SWE-bench: 72.1%
“works for me”
“rolled back to 2.0.62”
“we lost 2 days”
OpenAI declared SWE-bench Verified dead — contamination across all frontier models. The primary benchmark is broken. — OpenAI, Feb 2026
“Outside of vanity metrics, I have nothing of value to show.” — Principal engineer, 900-person company (source)
Model and harness updates change AI behavior. Config changes compound. Quality drifts — and the only way people notice are when the “vibes are off”.
Codex 5.2-high beats Codex 5.2-xhigh 67% of the time on real coding tasks. More thinking tokens isn’t always better. — Voratiq, 175 runs
The Untested Stack
Workflow Modehuman-in-the-loop vs background agentper task
Reasoning Settingslow / medium / highper task
Tool SettingsMCP servers, contexton update
SkillsSKILLS.mdweekly
Custom Instructions.cursorrules, AGENTS.mdad hoc
System Prompt / Rulesdirectory overridesweekly
Harness Versioncodex-cli 0.98 → 0.104on update
Model Selectionopus-4.5 → opus-4.6on switch
Base Model Behaviorchanges without noticemonthly
3 models × 4 instruction sets × 2 tool configs × 2 workflow modes × 3 reasoning levels × 2 harness versions
288 configurations. You’re testing: 1
It’s not just the model. It’s the harness, skills, the rules, the tools, the workflow. Every layer is a variable. None of them are tested together.
Median 3 tools per engineer; 14.7% use 5+. 49.1% use different tools for different tasks. — Pragmatic Engineer Survey, 2026
What passes for measurement
Green checks, red quality.
Tests
PR #847✓PASS
PR #848✓PASS
PR #849✓PASS
PR #850✓PASS
PR #851✓PASS
PR #852✓PASS
PR #853✓PASS
PR #854✓PASS
── The Gate ──
AI generates more PRs. Each one still needs human review. Tests / CI are the gate, not the source of truth.
“One of our main challenges has been code reviews, as the quantity of code produced goes up, and quality used to go down, pre-Opus 4.5.” — Staff engineer, 30-person company (source)
~50% of test-passing SWE-bench PRs would not be merged by repo maintainers. — METR, March 2026
zod #4843 — Fix branded-primitive typing in error tree
Real patches from Stet evaluation runs · zod dataset
You don’t ship untested code changes.
Stop shipping unmeasured agent changes.
The teams that get this right tell their agent to test itself. Every model swap, skill change, or config update gets measured before rollout — on their own code, their own tests, their own standards.
We ran two models on 60 tasks from a real open-source repo. Same tasks. Same tests. Here’s what the data showed.
GPT-5.3 Codex → GPT-5.4
codex cli (Mar 2026)
Pass rate75% → 79% (+4%)
Review quality19% → 32% (↑)
correctness1.6 → 1.9
edge case handling1.5 → 1.9
maintainability1.8 → 2.2
Cost/task$3.06 → $0.67 (↓ 78%)
Verdict:PROMOTE
PromoteHoldRollback
Drift
Quality degrades by default. Every update is a risk.
Hypothesis
288→1
Every change is an experiment. Test it on your code.
Above the Gate
Pass rate ≠ quality. The gap is where models diverge.
Stet replays your merged PRs, scores quality above pass/fail, and delivers comparison reports through your agent. Recurring runs and release gates follow.
We’re publishing our evaluation results openly.
See the data →