How to scale agentic evaluation: Lessons from 200,000 SWE-bench runs

7 min read Original article ↗

Agentic benchmarks can have hundreds of problems, each requiring dozens of steps. If you want statistical confidence you need to run them multiple times. That adds up to tens of thousands of runs, but evaluation methods designed for short, linear executions don’t hold up once agent trajectories become long, branching, and stateful. 

In the process of applying AI21 Maestro, our test-time compute agentic framework, to SWE-bench Verified, we ran over 200,000 evaluations. In this blog, we describe the evaluation architecture we built to address these challenges at scale, and share practical takeaways for developers, AgentOps, and MLOps teams evaluating agentic systems in production.

Why agentic evaluation is an infrastructure problem

Evaluating a standard LLM is already non-trivial, but agentic systems introduce layers of complexity that traditional one-shot evaluation pipelines aren’t architected to handle. Because agents like AI21 Maestro are stateful, active participants—traversing codebases, executing terminal commands, and iterating—the challenges are less about methodology and more about infrastructure bottlenecks.

To reach statistical confidence at scale, we had to solve for three specific technical constraints:

  • The throughput wall (Duration & Duplicity): Unlike a chatbot that responds in milliseconds, an agentic run is a high-latency, multi-step workflow. When you factor in Duplicity—the requirement to run the same benchmark instance across multiple random seeds to measure variance—you face a massive compute overhead. We weren’t just running a test; we were orchestrating thousands of concurrent, multi-minute stateful sessions.
  • State collision (The isolation requirement): Agents have the ability to write – they modify file systems and change environment states. To ensure evaluation integrity, every trajectory must be strictly isolated to prevent one agent’s hallucinated code change from poisoning the environment for the next. This creates a massive container orchestration challenge: how do you provision thousands of pristine sandboxes without the overhead killing your performance?
  • The reliability tax (Resumability): In a campaign of 200,000 runs, infrastructure failures (pod evictions, network blips, or runner timeouts) are a statistical certainty. In long-running evaluations, losing a 2-hour trajectory at the 99% mark is a massive waste of tokens and compute. We needed a decoupled architecture that separates Generation (patch creation) from Evaluation (test execution), allowing for granular resumability.

These aren’t just theoretical hurdles; they represent the primary technical friction that caused our initial local and naive Kubernetes implementations to fail under load.

Trial #1: The “local-first” assumption meets cloud reality

The official SWE-bench code was written to run locally. It assumes you have a Hugging Face cache directory to avoid re-downloading the benchmark, and it spins up Docker containers both for inference and for evaluation, assuming docker run just works.

In practice, meaningful evaluation doesn’t happen on a local machine beyond the initial debugging phase. While local runs are useful for quick iteration and setting breakpoints, evaluation at scale requires proper infrastructure. Ours is Kubernetes-based, orchestrated via Argo Workflows – an environment where we can reliably measure performance, quality, and the white-box metrics we care about, while benefiting from elasticity, fault tolerance, and horizontal scaling via the cluster’s HPA.

Unfortunately, none of the SWE-bench assumptions holds true in Kubernetes.

First, fresh pods don’t store cached data. Simply running the SWE-bench code as is resulted in each pod attempting to download the benchmark from HuggingFace on startup. At scale, we started getting hit with 429 rate limits – not because of any sophisticated attack, just because we were accidentally requesting the same files thousands of times.

Second, it was clear that just running SWE-bench code as-is in our Kubernetes environment won’t work. You can’t simply docker run inside a Kubernetes pod and expect it to work smoothly. Container-in-container execution requires privileged access, adds overhead, and complicates resource management. We needed to modify the way we were running the code to make it compatible with our test environment. 

Trial #2: Adapt the SWE-bench code

As a second attempt, we tried to keep SWE-bench’s original Docker-based evaluation logic and make it work inside our Kubernetes environment. We adapted the code so that each evaluation run would provision a fresh Docker container from within the Kubernetes pod running the evaluation, download the repository, install dependencies, execute AI21 Maestro, evaluate the resulting patch, and then tear everything down when finished.

At the scale of SWE-bench this quickly adds up. With roughly 500 instances, running two variants and repeating each run multiple times to measure variance (e.g, duplicity 4, 8,16) meant provisioning and tearing down on the order of 16k Docker containers over a single evaluation window. 

It ran – but slowly and with significant failure rates caused by resource contention and external rate limits. 

We had to find a way to declutter the system and improve uptime. 

Trial #3: Sharing resources across runs

Our real ‘aha!’ moment came when we tallied what was actually unique per run, versus what could be shared across all runs for that instance:

Trial #3: Sharing resources across runs

What if we could provision what was shared between runs – and then allow many subsequent runs to use those resources? In other words, what if we could make the simulation environment multi-tenant?

So, we streamlined our evaluation stack to resemble this architecture:

Multi-tennant evaluation architecture environment

To test this new multi-tenant simulation environment, we deployed approximately 500 pods in our Kubernetes cluster (one for each SWE-bench instance). Each pod contains the shared resources listed above:

  • The repository checked out at the correct commit
  • The MCP server ready to execute commands
  • Installed dependencies that persist across runs

These pods are provisioned just once (at the evaluation’s start) and can then go on to serve dozens of AI21 Maestro runs targeting that same SWE-bench code instance—whether those runs are in sequence or in parallel. This may seem counterintuitive at first as running multiple trajectories on the same instance will obviously clash as repositories get first and files change. That’s why we extended the MCP protocol and developed a dedicated MCP client for isolated environments.

By streamlining repo downloads and pod provisioning, we saw a dramatic drop in our failure rate:

 streamlining repo downloads and pod provisioning

In this setup, wall-clock time is dependent on parallelism and variant complexity, not infrastructure overhead. Run times vary between variants:

  • A lightweight variant averages 3.5 minutes per example
  • A reasonable variant takes around 10 minutes
  • A thorough variant can exceed 2 hours per example

With our current architecture supporting up to 8K parallel runs, total wall time becomes the duration of the longest individual run. For a typical variant, that’s about 20 minutes for a full SWE-bench evaluation.

Pro tip: Building in resumability 

We designed our evaluation system with a clear separation between the generation step and the evaluation step:

  • Generation step: AI21 Maestro analyzes the issue, explores the codebase, and produces a patch
  • Evaluation step: Apply the patch, run the test suite, determine if the issue is resolved

If a run fails during evaluation—say, the test runner times out or a pod gets evicted—we don’t need to regenerate the patch. The generation artifacts are persisted. We can re-run just the evaluation step, saving all the time and tokens spent on generation.

This also enables working with partial data. If 80% of runs complete successfully, we can analyze those results while the remaining 20% are retried or investigated. We’re not blocked waiting for 100% completion to learn something useful.

Scaling multi-tenant simulation environments 

With the multi-tenant simulation environment in place and our metrics added, we then focused on scaling. Our current SLA supports 10,000 parallel runs across all evaluations; going forward, we expect to push that number to at least 10X higher by adding multiple executors on the same pod.

Per-run provisioning, with all of its associated rate limits and resource contention bottlenecks, would never have allowed us to reach the scale that we did with multi-tenant simulation environments:

Scaling multi-tenant simulation environments

Not only is reaching this scale necessary for building statistical confidence in any approach, it can also surface insights about system efficiencies that aren’t visible in small-scale testing. You need thousands of runs, mirroring the real-world expected usage of enterprise AI systems, to identify and measure those efficiencies. 

When you can run 35,000 isolated evaluations in a day without spinning up 100,000 containers, you can iterate on agents the way we’ve iterated on models: with data, at speed, and with confidence.