Unsupervised benchmark for evaluating CLI coding agents on bug detection.
TL;DR: Agents playing treasure hunt! Challenger agents hide bugs (including a ground-truth bugs.json manifest), reviewer agents try to find those bugs, and an LLM matcher scores assignments. No human labeling.
- Challenger agent injects bugs into a repo and writes a
bugs.jsonground-truth manifest (prompt); bug count scales with repo size and is capped (currently 24 per challenge). If the manifest JSON is malformed, an LLM reformatting step repairs it into our schema (system, user) - Different agent reviews the code blind and emits
bugs/*.jsonfindings (prompt) - LLM judge scores bug-to-finding assignments (prompt)
Note: This benchmark tests CLI tools (Claude Code, Codex CLI, Gemini CLI), not the underlying models directly. Each CLI has its own system prompt, tool implementations, and default settings that affect behavior.
Results
150 challenges (3 challengers x 50 repos), 450 reviews, 2,603 injected bugs.
Scoring policy for reported results:
- matcher consumes raw
bugs/*.jsonfinding payloads from reviewer runs - matcher
--repeat 5 --aggregate median
Weighted bugs found (%):
🟩 Claude: █████████████████████████████ (58.05%)
🟧 Codex: ███████████████████ (37.84%)
🟦 Gemini: ██████████████ (27.81%)
| Reviewer | Weighted Bugs Found | Unweighted Detection Rate |
|---|---|---|
| Claude | 1,511 / 2,603 (58.05%) | 61.65% |
| Codex | 985 / 2,603 (37.84%) | 43.17% |
| Gemini | 724 / 2,603 (27.81%) | 34.64% |
Why two metrics:
- Unweighted = mean of per-challenge detection rates (each challenge counts equally).
- Weighted = global bug-level recall (
sum(bugs_found) / sum(total_bugs)).
For a concise write-up, see REPORT.md.
Usage
uv sync uv run cheddar challenge claude slugify # inject bugs uv run cheddar review codex slugify -c <id> # review blind uv run cheddar match -c <id> # score (single run) uv run cheddar match -c <id> --repeat 3 --aggregate median # score (median-of-3)
Scoring
An LLM judge (gpt-5.2) performs one global assignment per review: all injected bugs vs reviewer bugs/*.json payloads. The judge is instructed to match only when file/location/mechanism align and to leave uncertain bugs unmatched. Each match includes a supporting quote and line from reviewer findings.
To reduce judge stochasticity, scoring uses repeats with median aggregation (--repeat 5 --aggregate median).
Scoring uses raw reviewer finding payloads as produced by agents.
Agents
We tested official CLI tools with autonomous permission flags enabled on sandboxed challenge repositories.
| Agent | Model | CLI Flags |
|---|---|---|
| Claude Code | claude-opus-4-6 | --dangerously-skip-permissions --output-format stream-json --verbose |
| Codex CLI | gpt-5.3-codex | exec --dangerously-bypass-approvals-and-sandbox --skip-git-repo-check --json -c model_reasoning_effort='medium' |
| Gemini CLI | gemini-3-pro-preview | --yolo |
Model versions are configured in src/cheddar/agents/config.py; CLI command flags are defined in the per-agent wrappers under src/cheddar/agents/*.py.
PRs welcome for additional agents.
Dataset
Benchmark corpus: 50 open-source utility libraries across JS/TS/Python/Go/C/Ruby/Rust/Java/C#.
Source repos live under repos/ (uv run cheddar list repos).
Reference snapshot:
- challenges archive (.tar.gz) —
sha256:5fb101ccef70642875beab4fa40245fd83e1405bc1e6056eac196d52b73ad237 - manifest (manifest.json) —
sha256:185700655efd7835a1f5ee0e332313c5f17865192b6860ceb7c56fede7a34f97