The AI Coding Era Makes Boring Tests More Valuable

5 min read Original article ↗

Agents can now produce cheap, plausible diffs. A developer can describe a bug, give it to an agent, and get a patch with tests in under a minute. Checking whether a patch does the right thing, fits the codebase, and will continue to work as the project evolves isn't cheap either.

Generation speed is faster than review capacity. When the agent writes both the implementation and the tests, a green test suite can feel like a badge instead of an assurance that the code is correct. Teams still have to decide if generated tests are independent proof, or just more code to review.

The diff indicates behavior

When a patch does not have a test, the reviewer has to guess the intent of the code. A good test says, in no uncertain terms, the behavior you want: for this input, this has to be true; this regression has to stay fixed. Boring tests are important because they are easy to read and focus on one behavior.

METR's maintainer-review study finds benchmark success might be better than what maintainers will merge. Four active maintainers across three real repositories manually reviewed 296 AI-generated PRs that had already passed an automated grader. The grader's pass rate was about 24.2 percentage points higher than maintainer merge decisions, and about half of those test-passing PRs would not have been merged. Tests are green, but patches for core functionality failures, breakage elsewhere, and code quality issues are still rejected by maintainers.

Passing tests reduces uncertainty, but review still has to go on. Reviewers still need to evaluate fit, scope, maintainability, and whether the tests express the right behavior. FrontierCode separately scores behavioral correctness, regression safety, test correctness, scope, and code quality.

Generated tests can inherit the same bug

The worst mistake is to believe that the tests of an agent are independent evidence. They can have the same misunderstanding. The code passes because it matches that idea, and the tests pass because they assert that idea.

SWE-bench Verified exists because the quality of tests determines what benchmark results mean. 93 Python developers from OpenAI labeled 1,699 SWE-bench samples. They flagged 61.1% for unit tests that might unfairly reject valid solutions, 38.3% for underspecified problem statements, and filtered out 68.3% overall. Test review must ensure that assertions and fixtures match the expected behaviour, including edge cases.

Start with assertions, fixtures and edge cases. Ask if the test would catch a plausible wrong implementation. A test that just says the value is not null, or just says it matches a snapshot, or just says the happy path repeats is weak evidence. Incomplete test coverage is also cited by FrontierCode as a source of false positives.

Begin with the test diff. If it makes the behavior obvious, then check the code diff before trusting it.

Fail the first test

See if the new test would have failed before the patch, for bug fixes and behavioral changes. A test that fails on the base commit and passes after the change tests the behavior being fixed. A test passing before and after is probably testing something else.

This is included in the evaluation model of SWE-bench. FAIL_TO_PASS tests are failing before the solution PR and passing after it. PASS_TO_PASS tests also ensure that unrelated behavior works. It checks both "did the fix solve the issue?" and "did the fix not cause regressions?"

FrontierCode applies the same idea to agent-written tests: it runs submitted tests on the base commit, and requires that they fail. If they pass along the original broken code, they have not caught the bug. This is a cheap and deterministic check.

That rule does not always work. Some characterization tests prior to a refactor, tests for brand-new APIs, and some performance or concurrency tests might pass on the base commit for valid reasons. There should usually be failure before a fix for a test that is supposed to prove a behavioral fix. If it is absent, check why the test still passes before the fix.

Smaller patches are easier to review

SlopCodeBench measured what happens when agents extend earlier solutions across iterative development checkpoints. No evaluated agent solved any problem end to end. The best strict checkpoint success rate was 14.8%. Structural erosion increased in 77% of the trajectories, and verbosity in 75.5%. Some early damage was limited by prompt-side interventions, but drift was not prevented, and both tested prompts increased cost per checkpoint at the same time as strict performance declined.

Shortcuts are accumulated and passed down to subsequent tests as agents continue to build on prior choices. Late tests are more likely to reflect the structure that appeared than to verify one definite behavior. Small tasks leave room for boring tests that check one behavior. Broad tasks make tests vague receipts for a large generated diff, harder to review.

FrontierCode takes scope into account as a mergeability criterion, by considering file boundaries, diff size and semantic locality. Scope determines whether a test is able to describe what changed such that a reviewer can verify it.

Define intended behavior in plain language, then ask for code. Keep a patch to one behavior change. When possible, run new tests against the base commit. Review the test diff as part of the patch.

Vroni What I'm building

Delegate tasks. Get software.

Give Vroni a GitHub issue, bug report, spec, or rough idea. It reads the repo, plans the change, writes code, runs checks, and works toward a review-ready pull request.

Take a look at vroni.com

I respect your privacy. Unsubscribe at any time.