RepoGauge — Evaluate coding agents on your own code

Measured. Reproducible. Grounded in your code.

Stop spending money on vibes. Use data to evaluate your coding agents.

You evaluate your engineers. Why not your agents? RepoGauge turns your commit history into a reproducible benchmark, runs every model against the same tasks, and shows the numbers that matter — pass rate, cost per solved bug, latency, and silent regressions — in one durable artifact trail.

repogauge / runs / 2026-04-17-acme

resolved 84 / 120

Pass rate · leader

71% +4.2

Cost / solve

$2.81 -$0.36

Regression watch

-6.4% drift

Solver leaderboard

120 tasks · gold-verified · seed 42

Cost / quality frontier

higher-left wins

Attempt ledger

live · last 4

auth/session.rs · fix_null_tok

0.9s pass

parser/lexer.py · off_by_one

2.3s pass

router/match.ts · regex_edge

4.1s fail

queue/retry.go · backoff_cap

1.8s run

Benchmarks every major coding agent on the same task set

CClaude CLI CoCodex CLI ocopencode AgSWE-agent AiAider CuCursor

Why this exists

Choosing an agent gets expensive fast.

Public leaderboards rarely look like your codebase. Providers change silently. Teams end up paying for overlapping assistants because nobody can prove which one actually ships better code. RepoGauge gives that debate a stable measurement loop.

Built from your own commits

Mine real bugfix commits from your default branch. Each task carries the gold patch and failing tests, so the benchmark stays grounded in work your team already shipped.

dataset.jsonlgrounded in actual fixes

Privacy first, local by default

Everything runs on your machine. Your repositories never leave unless you explicitly point a solver at a remote provider. Hosted workflows execute on your own infrastructure or inside verifiable trusted execution environments (TEEs) — RepoGauge never sees your code.

local-firstyour code stays yours

Quality priced next to cost

Token spend, cache hits, wall-clock time, and expensive tails sit right beside pass rate. You see exactly where a premium model earns its price and where the cheap tier already wins.

analysis_report.jsondecisions you can defend

Before / after

Replace taste-based decisions with evidence.

Most teams pick an agent after an afternoon of vibes. A repeatable benchmark changes the conversation from "which one feels smarter" to "which one earns its price on our codebase, this month."

The current default

Demos and vibes

A new model lands. Someone tries it for an afternoon and creates a compelling Flappy Bird clone, should work well for Python ETLs right?

No stable baseline between providers or months.
Benchmarks never resemble your repos.
No cost-per-solved-bug to keep quality honest.
Model providers tweak thinking budgets to save costs leading to silent regressions. Your bills go up.

The RepoGauge loop

A measurement pipeline you can rerun next month.

Mine real fixes, validate each task before scoring, run the matrix, and leave behind analysis artifacts you can diff over time.

Same dataset, same rules, same codebase slice.
Pass rate tied to the human fix and its failing tests.
Cost, tokens, latency reported beside quality.
Train a router so premium models get premium work.

The pipeline

From commit history to benchmark-grade evidence.

The workflow is explicit: mine, review, export, validate, run, analyze, then optionally train a router. Every stage emits an artifact contract, so the whole system stays inspectable and concrete.

01 · Mine candidates

Start from the fixes your team already shipped.

RepoGauge scans the default branch for bugfix-shaped commits and emits a candidate pool that can be reviewed, filtered, and exported into a benchmark built from your own repository. The benchmark stays grounded in real, observed engineering work.

Artifact: candidates.jsonl is the raw source pool for the whole benchmark.
Bias control: deterministic heuristics reduce the temptation to overfit selection to a preferred solver.
Low setup cost: pure rules mode works without any model calls.

02 · Review candidates

Turn a noisy history into a benchmark-worthy task list.

The review stage applies accept and reject heuristics, generates a browsable summary, and draws a clean boundary between strong bugfixes and changes that do not belong in the benchmark.

HTML output: an immediately reviewable report helps humans spot weak candidates quickly.
Rules first: LLMs can advise, but they do not get to define ground truth.
Faster curation: this is the quality gate before expensive evaluation work starts.

03 · Export dataset

Materialize a benchmark dataset from your repo.

Export turns accepted candidates into dataset instances, writes the gold predictions, and generates a repo-specific evaluation adapter so the scoring flow understands repositories it has never seen.

Compatibility boundary: the dataset and adapter pair is what keeps scoring reliable.
Gold preservation: every task keeps the human fix for validation and comparison.
Structured outputs: downstream stages resume cleanly without redoing export work.

04 · Validate gold

Make sure the benchmark is actually solvable before anyone is scored on it.

Gold validation runs the scoring flow against the human patches inside the container image, so broken tasks get caught before they contaminate solver comparisons.

Sanity check: confirms each benchmark instance can pass under the intended validation environment.
Resolved slice: emits the clean dataset subset worth spending solver time on.
Infrastructure clarity: separates bad tasks from bad model behavior.

05 · Run solver matrix

Test every solver on the same tasks, record the economics of each attempt.

RepoGauge executes the matrix configuration against the resolved dataset and records one row per attempt: patch, tokens, cost, duration, exit reason, and workspace artifacts.

Head-to-head fairness: same tasks, same scoring rules, same codebase baseline.
Workspace-backed CLIs: local agent tools run inside the containerized benchmark flow.
Rich telemetry: cost, speed, and solver behavior are attached to every attempt.

06 · Analyze run

Join patch quality to cost, latency, and failure mode.

Analysis turns raw attempts into decisions: pass rate, cost per solved task, expensive tails, timeout rate, and the spread between uniform routing and mixed strategies.

One report, full picture: solver outcomes and economics reported together.
Regression detection: rerun next month and diff like any other artifact.
Actionable summaries: it becomes obvious where each solver breaks.

07 · Train a router

Use benchmark results to control model spend.

Router training fits a small decision tree on analysis data so future task routing can trade off success probability against cost in a concrete, repo-specific way.

Practical outcome: premium models only get the work that justifies premium pricing.
Repo-specific: the boundary comes from your task distribution, not a generic leaderboard.
Future use: the benchmark can feed policy and routing decisions.

Answers, with evidence

The debates your team keeps having — now with artifacts.

These are the questions teams re-litigate in Slack, planning docs, and budget reviews. RepoGauge gives you a stable artifact trail that survives the argument.

01 / QUALITY

Which agent actually solves the most bugs here?

On your repo, against your historical bugfixes and test suites.

See per-solver pass rate and resolved-instance count.
Compare attempts on the same task slice so conversations stay anchored.
Spot quality leaders without losing sight of cost.

02 / ECONOMICS

Is the premium model worth its price?

Pass rate alone hides the economic story. RepoGauge shows the real tradeoff at the margin.

Track mean spend per solved task, with raw token totals in context.
Inspect the expensive tail where "best" stops being rational.
Quantify where the cheap model is already sufficient.

03 / DRIFT

Did a provider quietly regress last week?

Rerun the same matrix on the same dataset and diff the outputs against a stable baseline.

Regression checks become reproducible artifacts.
Quality and cost shifts can be diffed together.
Silent model changes become explicit evidence.

Who it is for

For teams making real platform and buying decisions.

Strongest fit for any group making high-leverage decisions about coding agents, spend, or platform defaults — and wanting an answer they can defend after the meeting ends.

Pick a default agent with something sturdier than demo impressions.

Engineering leaders need to answer a practical question: which assistant should the team use, and is it earning its seat? RepoGauge gives a repo-specific answer with enough depth to survive scrutiny.

Choose a default assistant using pass rate and cost per solved bug, then explain the choice in concrete terms.
See where a premium model wins decisively and where the cheap tier is already good enough.
Build a repeatable canary that catches provider regressions before the team feels them.

Make access and budget decisions with a real failure map.

Platform teams are typically mediating between user demand, cost controls, and infrastructure risk. RepoGauge helps them decide which agents to expose, which images to run, and what the guardrails should be.

Compare CLIs and providers in the exact containerized workflow your org will rely on.
Distinguish timeout rates, infrastructure flakes, and bad-patch patterns.
Feed routing or policy decisions with repo-specific training data.

Catch quiet regressions before they become organizational folklore.

If your job is to keep an eye on provider quality, a reproducible benchmark on your own repositories is a sharper instrument than scattered developer complaints.

Keep a stable dataset and rerun it across model or provider updates.
Diff quality and cost changes together so one metric does not hide the other.
Build a canary suite that reflects your codebase and day-to-day work.

Quickstart

Five commands from repo to comparison.

The workflow is linear on purpose. Stop after any stage, inspect artifacts, or keep going to a full solver comparison and router-training dataset.

Prepare the environment

Use the project's uv workflow so dependencies stay reproducible.

Build the benchmark

Mine your repo, review candidates, and export a benchmark dataset.

$ uv run repogauge mine /path/to/repo
$ uv run repogauge review candidates.jsonl
$ uv run repogauge export reviewed.jsonl

Validate, run, analyze

Confirm gold solvability, run the matrix, and produce a report that survives review.

$ uv run repogauge eval dataset.jsonl --gold
$ uv run repogauge run examples/matrix.yaml
$ uv run repogauge analyze ./out/run/<run_id>

Hosted option

Want this as a managed platform?

If a managed version would save your team time across many repositories, leave your details and what you would need. That shapes the hosted roadmap around real usage.

Tell us what a managed version would need to do.

Share how you want to run evaluations across your repositories, where spend needs tighter control, and what would make the results easier to act on across the organization.

1Share the repositories, team shape, and agent stack you care about.
2Tell us whether you care most about lower spend, regression alerts, or clearer reporting.
3We prioritize the hosted roadmap around the jobs teams actually need done.

Hosted interest

Leave your details, or book a short intro.

If you already know you would use a hosted version, open the form. If you want a quick conversation first, use the calendar link.

Articles

Writing for teams trying to adopt coding agents without guessing.

Practical pieces you can publish and share with engineering leaders, platform teams, and budget owners who need evidence before changing their workflow.

The outcome

A benchmark built from your repositories is a cleaner way to choose coding agents.

Durable benchmark for your own repo. Honest comparisons. The cost and quality evidence you need to choose coding agents — and defend the choice six months later.

Lower model spend Confidence in cheaper models Cleaner provider decisions Regression visibility Shareable benchmark reports