Your Workflow is the Eval – Voratiq

Tests tell you whether code passes assertions. But not everything you care about fits in an assertion. It's hard to test whether the change matches what was asked, whether the approach fits the codebase, or whether it bundles in restructuring nobody wanted.

We see this clearly in our own workflow. We run multiple agents on the same spec, run tests in each workspace, review the candidate diffs, and apply one selected patch.

In our data, passing tests is associated with a 1.8x lift in selection rate, but being top-reviewed is associated with a 9.9x lift.

Tests help, but they are a weak proxy for the code we tend to accept.

If you want to measure what tests miss, use your own workflow as the eval: gather real outcomes, understand what drove acceptance, and use that to make better future decisions.

What We See In Practice

We run this workflow in Voratiq: write a spec, run multiple agents against it, run the repo's tests and checks in each workspace, review the diffs, and apply the best selected patch.

That gives us three layers of signal on the same work:

Test: did the candidate pass the available checks?
Review: how strong the candidate looked in comparative review
Selected: which patch was accepted

For this analysis, we looked at 399 runs and 4,784 candidate implementations.

The data covers about three months of day-to-day engineering work, mostly medium-to-high difficulty JS/TS, Python, and Swift feature, refactor, and bug-fix tasks across CLI, UI, data, and runtime code.

The chart below shows how much two different signals track the selected patch: whether a candidate passed tests, and whether it was top-reviewed.

Selected patch rate by test and review signal

Across this data, candidates that passed tests were selected 11.5% of the time, versus 6.5% for candidates that did not, a 1.8x lift. Top-reviewed candidates were selected 49.0% of the time, versus 5.0% for candidates that were not top-reviewed, a 9.9x lift.

Test-based signals move the odds a bit, but they miss many of the qualities that drive acceptance. Review tracks selection more closely because it evaluates many of the same qualities that influence which patch gets accepted.

When a candidate passed tests but was not selected, the pattern was usually one of three things: it did not really solve the task that was asked, it took an approach that fit the codebase worse and was harder to maintain, or it bundled in extra restructuring the task did not ask for.

A few examples:

In one cleanup task, the losing patches reorganized nearby scripts and startup logic beyond the requested change.
In one refactor, the winning patch separated the data flow more clearly and kept the change smaller.
In one UI task, the losing patch fixed the issue indirectly with overrides, while the winner fixed the source of the problem directly.

Scope adherence, approach, and codebase fit are hard to encode in tests, but review often surfaces those judgments. That's what separates a passing patch from a selected one.

What This Means For Coding Evals

Tests answer a narrower question than review does. They tell you whether a patch cleared the available assertions. They do not tell you whether it was the right change.

If you rank or route agents based mainly on test evals, you can end up using agents that look strong on paper, but produce worse outcomes in your workflow.

Public benchmarks still have a place. They are fast, reproducible, and good for directional comparisons.

If you want signal that is closer to real engineering decisions, measure outcomes inside the workflow itself:

Real tasks from your repo
The repo's own checks and constraints
Review outcomes
The patch that was selected

That is slower and harder to collect. Review is expensive, and it takes real infrastructure work to capture those signals reliably.

But the benefit is better information about which workflows, models, and domains clear your team's quality bar. That gives you a better basis for future decisions.