Evals are broken. Stet fixes them.

3 min read Original article ↗
STET

Benchmarks are contaminated.
Your team has no way to tell
if AI is helping — or hurting.

The Redacted Report

Oct 2025Opus 4.5

██████████████

Nov 2025CC 2.0.76

█████████

Dec 2025Codex CLI 0.98

██████████████████

Jan 2026Codex 5.2

███████████

Feb 2026CC 2.1.37

████████████████

Feb 2026Opus 4.6Codex 5.3

████████████

no data collected

“is Claude broken today?”

“vibes are off”

SWE-bench: 72.1%

“works for me”

“rolled back to 2.0.62”

“we lost 2 days”

OpenAI declared SWE-bench Verified dead — contamination across all frontier models. The primary benchmark is broken. — OpenAI, Feb 2026

“Outside of vanity metrics, I have nothing of value to show.” — Principal engineer, 900-person company (source)

Model and harness updates change AI behavior. Config changes compound. Quality drifts — and the only way people notice are when the “vibes are off”.

Codex 5.2-high beats Codex 5.2-xhigh 67% of the time on real coding tasks. More thinking tokens isn’t always better. — Voratiq, 175 runs

The Untested Stack

Workflow Modehuman-in-the-loop vs background agentper task

Reasoning Settingslow / medium / highper task

Tool SettingsMCP servers, contexton update

SkillsSKILLS.mdweekly

Custom Instructions.cursorrules, AGENTS.mdad hoc

System Prompt / Rulesdirectory overridesweekly

Harness Versioncodex-cli 0.98 → 0.104on update

Model Selectionopus-4.5 → opus-4.6on switch

Base Model Behaviorchanges without noticemonthly

3 models × 4 instruction sets × 2 tool configs × 2 workflow modes × 3 reasoning levels × 2 harness versions

288 configurations. You’re testing: 1

It’s not just the model. It’s the harness, skills, the rules, the tools, the workflow. Every layer is a variable. None of them are tested together.

Median 3 tools per engineer; 14.7% use 5+. 49.1% use different tools for different tasks. — Pragmatic Engineer Survey, 2026

What passes for measurement

Green checks, red quality.

Tests

PR #847PASS

PR #848PASS

PR #849PASS

PR #850PASS

PR #851PASS

PR #852PASS

PR #853PASS

PR #854PASS

── The Gate ──

AI generates more PRs. Each one still needs human review. Tests / CI are the gate, not the source of truth.

“One of our main challenges has been code reviews, as the quantity of code produced goes up, and quality used to go down, pre-Opus 4.5.” — Staff engineer, 30-person company (source)

~50% of test-passing SWE-bench PRs would not be merged by repo maintainers. — METR, March 2026

zod #4843 — Fix branded-primitive typing in error tree

Real patches from Stet evaluation runs · zod dataset

You don’t ship untested code changes.
Stop shipping unmeasured agent changes.

The teams that get this right tell their agent to test itself. Every model swap, skill change, or config update gets measured before rollout — on their own code, their own tests, their own standards.

We ran two models on 60 tasks from a real open-source repo. Same tasks. Same tests. Here’s what the data showed.

GPT-5.3 Codex → GPT-5.4

codex cli (Mar 2026)

Pass rate75% → 79% (+4%)

Review quality19% → 32% ()

correctness1.6 → 1.9

edge case handling1.5 → 1.9

maintainability1.8 → 2.2

Cost/task$3.06 → $0.67 (↓ 78%)

Verdict:PROMOTE

PromoteHoldRollback

Drift

Quality degrades by default. Every update is a risk.

Hypothesis

2881

Every change is an experiment. Test it on your code.

Above the Gate

Pass rate ≠ quality. The gap is where models diverge.

Run it on your codebase

Stet replays your merged PRs, scores quality above pass/fail, and delivers comparison reports through your agent. Recurring runs and release gates follow.

We’re publishing our evaluation results openly.

See the data →