Your Coding Agent Is a Slot Machine

8 min read Original article ↗

Update (Dec 19): I originally stated this used Sonnet 4.5. I discovered my script wasn’t picking up the --model flag, so Claude Code used automatic model selection—which typically means Opus. This makes the Vibe comparison more significant: an open-weight model matching Anthropic’s flagship, not just Sonnet.

TL;DR

Same agent. Same bug. Same prompt. Different outcome—and even when it “wins,” it wins differently every time.

  • I ran Claude Code and Vibe (Devstral 2) on SWE-bench-verified-mini (45 cases).
  • Each case got 10 attempts per agent → 900 total runs.
  • ~40% of cases were “mixed”: same agent, same issue, different outcomes across the 10 runs.
  • Even when an agent always solves a case, it doesn’t solve it the same way: I saw ~8× patch-size swings on a 10/10 case.
  • The overall pass rates are close: Claude 39.8% vs Vibe 37.6%. On this setup, that difference is within statistical error.
  • They don’t fail on the same things: 5 cases only Claude ever solves, 4 cases only Vibe ever solves.
  • Vibe was a bit faster in this run.

Why I Did This

I wanted to measure something I couldn’t find in other benchmark posts: how stable the result is.

When we see a benchmark—“X% on SWE-bench Verified”—we mentally translate that into “it solves X% of problems.” The assumption is binary: it either can handle a problem or it can’t. Not that you got lucky this run. That shapes how I use the tool—if it works, I take the solution and move on. If it fails, I adjust the prompt or try another model. I almost never think: let me just rerun the same thing.

I knew agents aren’t deterministic. But I wanted to find a measure: how often does the same task give a different outcome? Is the variance big enough to matter?

Anthropic’s Claude Sonnet 4.5 post mentions they run multiple iterations per case on SWE-bench Verified, but they don’t publish the spread. Mistral’s Vibe announcement doesn’t really spell out methodology. I looked for this data and couldn’t find it, so I ran my own experiment.

I used Claude as the baseline because it’s what I actually use day-to-day. I picked Vibe (Devstral 2) because it’s recent, open, showed promising benchmark scores, and is free and easy to run.

Worth noting: SWE-bench-verified-mini appears harder than SWE-bench Verified (the full 500). Provider-published scores on Verified are roughly ~2× what I’m seeing on this mini subset. So I’m treating the absolute percentages as less important than the consistency patterns.

The Setup

Benchmark: SWE-bench-verified-mini — 45 real GitHub issues from Django and Sphinx.

Each case is a real bug: the agent gets the issue description and the codebase, then produces a fix. Hidden unit tests verify whether the fix actually works—the agent never sees them.

The dataset has 50 entries. I dropped 5 that “passed” even with an empty patch (nothing to fix). That left 45 cases where a patch actually matters.

Method:

  • 10 runs per agent per case
  • 450 runs per agent (900 total)
  • Same prompts, same evaluation criteria
  • 10 runs felt like the right trade-off—enough signal without spending another week to this.

Agents:

  • Claude Code with automatic model selection (I’ll call this “Claude”)
  • Devstral 2 via Vibe

Important: this is agentic benchmarking (Claude Code / Vibe), not raw model API calls. Anthropic also notes their benchmark setup is “simplified,” not necessarily Claude Code itself. In practice, scaffolding matters: system prompt, tools, guardrails, all of it.

The Results

The Variance Problem

About 40% of test cases were inconsistent across runs. Same agent, same issue, 10 attempts → a mix of passes and fails.

Test Case Consistency: Claude - 11 always pass, 18 mixed, 16 always fail. Vibe - 9 always pass, 19 mixed, 17 always fail.

Here’s the part I think people regularly misread:

When you see something like “77% on SWE-bench Verified,” it’s tempting to assume that means “it solves 77% of problems reliably.”

But there are (at least) two very different worlds that can both produce “77%”:

  • Model A: 77% of issues are basically “solved class” (10/10), and the rest are “nope” (0/10).
  • Model B: each issue is a roll of the dice. On any given attempt it has ~77% chance to land a solve.

Both average to 77%. In real use, they don’t feel remotely the same. And as I mentioned—I usually operate as if it’s Model A.

My results look a lot closer to “there’s a big middle bucket.” A bunch of cases live in “sometimes it gets there, sometimes it doesn’t.”

To make that clearer, I broke it down into three metrics:

  • Ceiling: “Did it solve this at least once?” (best case if you’re willing to rerun)
  • Reported (per-run pass rate): the usual benchmark number
  • Floor: “Did it solve it every time?” (reliability)

Three metrics: Ceiling (solved > 0/10) - Claude 64.4%, Vibe 62.2%. Reported (pass rate) - Claude 39.8%, Vibe 37.6%. Floor (solved 10/10) - Claude 24.4%, Vibe 20.0%.

For Claude here:

  • Ceiling: 64.4%
  • Floor: 24.4%

That’s a 40-point gap. I wasn’t expecting it to be that wide. The headline benchmark score sits in the middle and hides both extremes.

Patch Size Variability

Even when an agent is “solid” on a case, the shape of the solution can swing a lot.

Example: django__django-12262 — Claude passes 10/10, but patch sizes range from 716 to 5,703 bytes. That’s ~.

I expected the 10/10 cases to be boring—same solution every time. They weren’t.

This isn’t just one weird case. Across problems:

  • Claude: median coefficient of variation = 38%
  • Vibe: median coefficient of variation = 46%

Even when it works, you’re getting a different solution each time.

Overall Performance (closer than I expected)

Model Pass Rate Passed Runs 95% CI
Claude Code 39.8% 179/450 37.3% - 42.2%
Devstral 2 (Vibe) 37.6% 169/450 35.1% - 40.0%

On this setup, the 2.2% gap is inside the confidence intervals. I wouldn’t claim one “wins” overall from this alone.

Time Comparison

Model Mean Median Std Dev
Claude 357s 337s 177s
Vibe 296s 268s 150s

Vibe was faster here—nothing dramatic, but noticeable.

Where They Differ

Category Count
Both always pass 9
Both always fail 12
Only Claude ever succeeds 5
Only Vibe ever succeeds 4
Both have partial success 15
Claude beats Vibe 12
Vibe beats Claude 8

The “only X ever succeeds” rows stand out. They’re not just trading percentage points—they have strengths on different problems.

Only Vibe succeeds: sphinx-8551 (60% vs 0%), sphinx-9698 (30% vs 0%), sphinx-7748 (20% vs 0%), sphinx-8056 (10% vs 0%)

Only Claude succeeds: django-12325 (70% vs 0%), sphinx-7757 (40% vs 0%), sphinx-9461 (10% vs 0%), sphinx-9320 (10% vs 0%), sphinx-7985 (10% vs 0%)

“Single-run benchmarks can lie” examples

A few Claude cases were basically lottery tickets:

Test Case Pass Rate
sphinx-doc__sphinx-7985 10% (1/10)
sphinx-doc__sphinx-8269 10% (1/10)
sphinx-doc__sphinx-8475 10% (1/10)
sphinx-doc__sphinx-9281 10% (1/10)

If you benchmark those once, you might report 0% or 100% depending on luck. With 10 tries you at least see what bucket they’re in.

What the Numbers Don’t Capture: UX

The result I didn’t expect: Claude and Vibe are basically tied. Devstral 2 (open) landing within a couple points of Sonnet 4.5 (closed) on real bug-fix tasks is not what I predicted.

But: I use Claude Code daily. I’ve spent time with Vibe and Codex too. And the interaction quality still feels different.

Claude tends to “just get it” when my prompt is messy. It fills in gaps, makes decent assumptions, and I don’t have to babysit as much.

Vibe (and Codex, for me) more often needs a second pass: I rephrase, I restate constraints, I correct the direction. On a benchmark that’s perfectly specified upfront, that friction doesn’t show up. In real work, it absolutely does.

That’s why I’m still on Claude day-to-day. But an open model matching it on benchmarks? That wasn’t supposed to happen yet (or ever?)

The Bigger Issue

I think consistency is a bigger deal than it’s being discussed. It comes down to trust.

However good an agent is on benchmarks—if it’s not consistent, I spend time checking, verifying, babysitting. Every. Single. Time. That overhead doesn’t show up in benchmark scores, but it dominates my actual workflow.

Labs mostly focus on benchmark performance. The implicit assumption seems to be: if we hit 100%, the rest is minor. But 100% pass rate with 8x patch variability still means I can’t trust the output without review. I’m still in verification mode.

Improving consistency might matter more than improving performance. A tool I can trust 100% of the time on 60% of tasks beats a tool that’s 80% but unpredictable. The first one I can build a workflow around. The second one I’m always second-guessing.

This also applies to the “I gave the same prompt to two models” style comparisons you see everywhere. Given this variance, you could compare a model to itself and conclude one run is “better” than another. A single head-to-head tells you almost nothing—although it’s often entertaining to watch.

Caveats

  • Claude Code used automatic model selection, which typically defaults to Opus. I originally thought I was running Sonnet 4.5, but discovered my --model flag wasn’t being picked up.
  • 45 test cases from 2 repos - Django and Sphinx, Python only. This is too small to draw broad conclusions about model capabilities - treat these findings as directional, not definitive.
  • This is an agentic setup comparison (Claude Code vs Vibe), not a clean “model vs model” API test.

What’s Next

Next up I’m testing truly local models on consumer hardware. Early results: the gap is bigger there. The real question is how much bigger—and what you get back in exchange (cost, privacy, speed, control).

I also want to test quantized Devstral 2 and measure how much quantization moves both performance and variance.

About: I'm building AI tools at kvit.app. This benchmark is part of a broader project: figuring out what actually makes coding assistants useful beyond a single score.