Test Evals Are Not Enough

Most coding evals tell you whether a patch can satisfy a test suite. That is useful, but it is a narrower question than the one engineering teams usually care about. For most teams, the real question is which changes they would keep.

Those are different bars. A patch can pass tests and still miss the intent of the task, take the wrong approach, or create cleanup work the team does not want.

That gap shows up pretty clearly in our own development workflow. We run multiple agents on the same spec, run tests in each workspace, review the candidate diffs, and apply one selected patch.

Passing tests is associated with a 1.8x lift in selection rate. Being the top-reviewed candidate is associated with a 9.9x lift. Tests help, but they are a weak proxy for the code we tend to accept.

As coding agents take on more and more engineering work, that distinction matters more.

The strongest way to answer that question is to use your own workflow as the eval: gather real outcomes, understand what drove acceptance, and use that to make better future decisions.

What We See In Practice

We run this workflow in Voratiq: write a spec, run multiple agents against it, run the repo's tests and checks in each workspace, review the diffs, and apply the best selected patch.

That gives us three layers of signal on the same work:

test: did the candidate pass the available checks?
review: how strong the candidate looked in comparative review
selected: which patch was accepted

For this analysis, we looked at 399 runs and 4,784 candidate implementations.

The data covers about three months of day-to-day engineering work, mostly medium-to-high difficulty JS/TS, Python, and Swift feature, refactor, and bug-fix tasks across CLI, UI, data, and runtime code.

The chart below shows how much two different signals track the selected patch: whether a candidate passed tests, and whether it was top-reviewed.

Across this data, candidates that passed tests were selected 11.5% of the time, versus 6.5% for candidates that did not, a 1.8x lift. Top-reviewed candidates were selected 49.0% of the time, versus 5.0% for candidates that were not top-reviewed, a 9.9x lift.

Test-based signals move the odds a bit, but they miss many of the qualities that drive acceptance. Review tracks selection more closely because it evaluates many of the same qualities that influence which patch gets accepted.

When a candidate passed tests but was not selected, the pattern was usually one of three things: it did not really solve the task that was asked, it took an approach that fit the codebase worse and was harder to maintain, or it bundled in extra restructuring the task did not ask for.

A few examples:

In one cleanup task, the losing patches reorganized nearby scripts and startup logic beyond the requested change.
In one refactor, the winning patch separated the data flow more clearly and kept the change smaller.
In one UI task, the losing patch fixed the issue indirectly with overrides, while the winner fixed the source of the problem directly.

Scope adherence, approach, and codebase fit are hard to encode in tests. Review often surfaces those judgments. In practice, that is what separates a passing patch from a selected one.

What This Means For Coding Evals

Tests answer a narrower question than review does. They tell you whether a patch cleared the available assertions. They do not tell you whether it was the right change.

If you rank or route agents based mainly on test evals, you can end up using agents that look strong on paper, but produce worse outcomes in your workflow.

Public benchmarks are still useful. They are fast, reproducible, and good for directional comparisons.

If you want signal that is closer to real engineering decisions, measure outcomes inside the workflow itself:

real tasks from your repo
the repo's actual checks and constraints
review outcomes
the patch that was selected

That is slower and harder to collect. Review is expensive, and it can take meaningful infrastructure and analysis work to capture those signals reliably.

But the payoff is higher-fidelity information about which workflows, models, and domains clear your team's quality bar. That gives you a better basis for future decisions.