Built from your own commits
Mine real bugfix commits from your default branch. Each task carries the gold patch and failing tests, so the benchmark stays grounded in work your team already shipped.
dataset.jsonlgrounded in actual fixes
Measured. Reproducible. Grounded in your code.
You evaluate your engineers. Why not your agents? RepoGauge turns your commit history into a reproducible benchmark, runs every model against the same tasks, and shows the numbers that matter — pass rate, cost per solved bug, latency, and silent regressions — in one durable artifact trail.
repogauge / runs / 2026-04-17-acme
resolved 84 / 120
Pass rate · leader
71% +4.2
Cost / solve
$2.81 -$0.36
Regression watch
-6.4% drift
Solver leaderboard
120 tasks · gold-verified · seed 42
Cost / quality frontier
higher-left wins
Attempt ledger
live · last 4
auth/session.rs · fix_null_tok
0.9s pass
parser/lexer.py · off_by_one
2.3s pass
router/match.ts · regex_edge
4.1s fail
queue/retry.go · backoff_cap
1.8s run
Benchmarks every major coding agent on the same task set
CClaude CLI CoCodex CLI ocopencode AgSWE-agent AiAider CuCursor
Why this exists
Public leaderboards rarely look like your codebase. Providers change silently. Teams end up paying for overlapping assistants because nobody can prove which one actually ships better code. RepoGauge gives that debate a stable measurement loop.
Mine real bugfix commits from your default branch. Each task carries the gold patch and failing tests, so the benchmark stays grounded in work your team already shipped.
dataset.jsonlgrounded in actual fixes
Everything runs on your machine. Your repositories never leave unless you explicitly point a solver at a remote provider. Hosted workflows execute on your own infrastructure or inside verifiable trusted execution environments (TEEs) — RepoGauge never sees your code.
local-firstyour code stays yours
Token spend, cache hits, wall-clock time, and expensive tails sit right beside pass rate. You see exactly where a premium model earns its price and where the cheap tier already wins.
analysis_report.jsondecisions you can defend
Before / after
Most teams pick an agent after an afternoon of vibes. A repeatable benchmark changes the conversation from "which one feels smarter" to "which one earns its price on our codebase, this month."
A new model lands. Someone tries it for an afternoon and creates a compelling Flappy Bird clone, should work well for Python ETLs right?
Mine real fixes, validate each task before scoring, run the matrix, and leave behind analysis artifacts you can diff over time.
The pipeline
The workflow is explicit: mine, review, export, validate, run, analyze, then optionally train a router. Every stage emits an artifact contract, so the whole system stays inspectable and concrete.
RepoGauge scans the default branch for bugfix-shaped commits and emits a candidate pool that can be reviewed, filtered, and exported into a benchmark built from your own repository. The benchmark stays grounded in real, observed engineering work.
candidates.jsonl is the raw source pool for the whole
benchmark.
The review stage applies accept and reject heuristics, generates a browsable summary, and draws a clean boundary between strong bugfixes and changes that do not belong in the benchmark.
Export turns accepted candidates into dataset instances, writes the gold predictions, and generates a repo-specific evaluation adapter so the scoring flow understands repositories it has never seen.
Gold validation runs the scoring flow against the human patches inside the container image, so broken tasks get caught before they contaminate solver comparisons.
RepoGauge executes the matrix configuration against the resolved dataset and records one row per attempt: patch, tokens, cost, duration, exit reason, and workspace artifacts.
Analysis turns raw attempts into decisions: pass rate, cost per solved task, expensive tails, timeout rate, and the spread between uniform routing and mixed strategies.
Router training fits a small decision tree on analysis data so future task routing can trade off success probability against cost in a concrete, repo-specific way.
Answers, with evidence
These are the questions teams re-litigate in Slack, planning docs, and budget reviews. RepoGauge gives you a stable artifact trail that survives the argument.
01 / QUALITY
On your repo, against your historical bugfixes and test suites.
02 / ECONOMICS
Pass rate alone hides the economic story. RepoGauge shows the real tradeoff at the margin.
03 / DRIFT
Rerun the same matrix on the same dataset and diff the outputs against a stable baseline.
Who it is for
Strongest fit for any group making high-leverage decisions about coding agents, spend, or platform defaults — and wanting an answer they can defend after the meeting ends.
Engineering leaders need to answer a practical question: which assistant should the team use, and is it earning its seat? RepoGauge gives a repo-specific answer with enough depth to survive scrutiny.
Platform teams are typically mediating between user demand, cost controls, and infrastructure risk. RepoGauge helps them decide which agents to expose, which images to run, and what the guardrails should be.
If your job is to keep an eye on provider quality, a reproducible benchmark on your own repositories is a sharper instrument than scattered developer complaints.
Quickstart
The workflow is linear on purpose. Stop after any stage, inspect artifacts, or keep going to a full solver comparison and router-training dataset.
1
Use the project's uv workflow so dependencies stay reproducible.
2
Mine your repo, review candidates, and export a benchmark dataset.
$ uv run repogauge mine /path/to/repo
$ uv run repogauge review candidates.jsonl
$ uv run repogauge export reviewed.jsonl
3
Confirm gold solvability, run the matrix, and produce a report that survives review.
$ uv run repogauge eval dataset.jsonl --gold
$ uv run repogauge run examples/matrix.yaml
$ uv run repogauge analyze ./out/run/<run_id>
Hosted option
If a managed version would save your team time across many repositories, leave your details and what you would need. That shapes the hosted roadmap around real usage.
Share how you want to run evaluations across your repositories, where spend needs tighter control, and what would make the results easier to act on across the organization.
Hosted interest
If you already know you would use a hosted version, open the form. If you want a quick conversation first, use the calendar link.
Articles
Practical pieces you can publish and share with engineering leaders, platform teams, and budget owners who need evidence before changing their workflow.
The outcome
Durable benchmark for your own repo. Honest comparisons. The cost and quality evidence you need to choose coding agents — and defend the choice six months later.
Lower model spend Confidence in cheaper models Cleaner provider decisions Regression visibility Shareable benchmark reports