Agents Are Better Testers Than We Are

7 min read Original article ↗

Adam Prout

Most of the conversation around agents and software engineering is about techniques for getting them to write higher quality code. I’m interested in a topic that is getting less attention amid all the excitement: agents are remarkably good testers and bug fixers. They can take the results of a test run, deduplicate failures, analyze logs or crash dumps, identify root causes, and propose fixes. In many cases, they can do all of this with very little human intervention. They’re also getting better and better a probing software for bugs directly without the need for traditional test cases.

For database development, or any kind of systems software, this is a big deal. I’ve spent most of my career working on database engines (SQL Server, MemSQL/SingleStore, HorizonDB/Postgres), where the bar for correctness is very high: data loss and wrong results are unacceptable. That means a huge amount of engineering time goes into testing and test infrastructure. At MemSQL, over the course of a decade, we built roughly half a million lines of Python test code and over a million lines of SQL test queries with expected outputs, all handwritten. Running the full test suite required 500 machines and took 7–8 hours every night. Many of those tests were heavily randomized, injecting failures into IO, networking, and memory allocations to flush out subtle concurrency bugs and feature interactions. Agents in 2026 are very good at doing alot of this work for us. I wish they had been around decades ago. I would have spent more time building features, less time chasing bugs, and likely shipped higher-quality systems faster.

Testing is not the only ingredient of software quality. Code reviews, design reviews, release processes, and team culture and level of experience all matter. But testing is the most repeatable of these. You can’t scale deep system expertise. Every change won’t be written or reviewed by someone with decades of experience in the codebase. What you can scale is a strong regression suite. Good tests let junior engineers safely modify complex systems, catching the kinds of correctness issues that would otherwise require years of intuition to avoid.

Writing Tests and Root Causing Bugs

I assume folks are aware by now that agents are very good at writing tests so I’m not going to spend much time on it. In my experience, agents do their best work when given a reasonably precise idea of the key test scenario i.e., “write me a test case that hard kills Postgres with random write workloads running that exercises the unique WAL writes and page write code paths, Ensure failover happens promptly and that Postgres can properly restart and rejoin the cluster”.
Once you have a large set of tests, similar to the suite MemSQL test suite I mentioned above, these tests will generate a constant stream of failures. Every pull request and nightly run produces logs, crash dumps, and test artifacts. At this point, the bottleneck isn’t finding bugs, it’s root causing and fixing them. Agents can take over that work. Given access to logs, source code, and test infrastructure, they can deduplicate failures, correlate them with recent changes, run tests locally, identify root causes and propose fixes. This turns hours of manual debugging into minutes of reviewing the agents RCA analysis. At scale, most failures aren’t new product bugs, they’re flaky tests, bad assumptions, or duplicate issues. Agents handle this well: grouping failures, filtering noise, and producing structured explanations and fix proposals.

Once you have agents fixing bugs, the loop closes naturally. The agent diagnoses a failure, proposes a fix, opens a PR, and reruns the tests. CI stops being just validation and becomes a repair system. This kind of agentic workflow, acting directly on the repo and pipeline, is already emerging in modern CI/CD systems and hopefully will be a standard feature soon.

Agents as Manual Testers

Agents aren’t limited to generating tests, they’re also effective at directly probing systems for bugs.

  • Security vulnerabilities and fuzzing
    Agents can actively explore a code base looking for vulnerabilities, similar to fuzzers but with more context. Given a target surface area, they can generate adversarial inputs, vary parameters, and reason about failure modes as they go. This has started getting more attention recently, with models like Mythos reportedly showing improved skill in finding vulnerabilities.
  • Workload generation for performance testing
    Performance testing of SQL databases has historically relied on static benchmarks (TPC-C, TPC-H, etc.), many of which are a few decades old. Agents can take a high-level description of a workload e.g., “a social network”. These more generic workloads can then be reused to detect performance regressions across releases. This approach allows for a greater breadth of performance testing that more closely matches the expected customer use cases of the system.
  • Advanced randomized exploration
    Databases have long relied on random query generators to surface correctness bugs. Alot of research has been done on how to effectively generate random queries to find bugs. Agents can do this work too, but more effectively. Instead of blindly generating valid queries, they explore the system with intent and more context on how the code works; combining features, mutating queries, and pushing edge cases. In practice, they behave like a much more capable random query generator, systematically probing the state space for logic and correctness errors.

The key difference is that agents don’t just generate inputs, they adapt based on what they observe, making them far more effective at finding subtle bugs.

Agents for Model Checking

Model checking with languages like P or TLA+ becomes much more accessible with agents. They’re reasonably good at taking a natural language specification, say a quorum commit protocol, and producing a working model in either language.

More broadly, agents make it easier for programmers to pick up unfamiliar languages. P and TLA+ have a steeper learning curve. With agent assistance, that barrier to trying them out is much lower. You can go from an informal spec to an executable model much faster, and explore correctness properties without deep prior expertise.

Closing Thoughts

In my experience, agents deliver more impact in testing and bug fixing today than they do in writing production database engine/systems code . They can take a failing test, analyze logs and crash dumps, identify the root cause, and propose a fix with a clear RCA summary and relevant code, all with little to no human effort. Most fixes are small, but historically expensive to diagnose. That cost to diagnose largely disappears.

As of today, I spend a lot of time reading and understanding every line of production code agents produce in detail. This makes agents more like a faster keyboard for me. Useful, sometimes significantly so, but not fully autonomous. I validate every design decision and often redirect the approach taken. This takes focused effort.

Bug fixing is different. Agents can do most of the work unattended. What used to take hours of scouring through logs or crash dumps is now a few minutes of review. If I used to spend 20–25% of my time debugging, I expect that to drop below 5% consistently, and likely much further.

Regardless of where you land on agent-generated code quality, focusing agents on testing and bug fixing will dramatically improve software quality. Large systems always have more failures than humans can triage. Agents can keep up. Teams that don’t wire agents into CI to do the unglamorous work of finding and fixing bugs continuously are falling behind.

As a side note: Good test infrastructure and clean layering were always important, but they’re even more so now. If a single end‑to‑end test takes 10 minutes to spin up dependencies before a single test case can run, you’re going to fall behind. Agents can generate and maintain tests quickly, but they can’t fix a system that’s fundamentally hard to test.