GitHub - przadka/cheddar-bench: Unsupervised benchmark for evaluating CLI coding agents on bug detection

3 min read Original article ↗

Unsupervised benchmark for evaluating CLI coding agents on bug detection.

TL;DR: Agents playing treasure hunt! Challenger agents hide bugs (including a ground-truth bugs.json manifest), reviewer agents try to find those bugs, and an LLM matcher scores assignments. No human labeling.

  1. Challenger agent injects bugs into a repo and writes a bugs.json ground-truth manifest (prompt); bug count scales with repo size and is capped (currently 24 per challenge). If the manifest JSON is malformed, an LLM reformatting step repairs it into our schema (system, user)
  2. Different agent reviews the code blind and emits bugs/*.json findings (prompt)
  3. LLM judge scores bug-to-finding assignments (prompt)

Note: This benchmark tests CLI tools (Claude Code, Codex CLI, Gemini CLI), not the underlying models directly. Each CLI has its own system prompt, tool implementations, and default settings that affect behavior.

Results

150 challenges (3 challengers x 50 repos), 450 reviews, 2,603 injected bugs.

Scoring policy for reported results:

  • matcher consumes raw bugs/*.json finding payloads from reviewer runs
  • matcher --repeat 5 --aggregate median

Weighted bugs found (%):

🟩 Claude: █████████████████████████████ (58.05%)
🟧 Codex:  ███████████████████ (37.84%)
🟦 Gemini: ██████████████ (27.81%)
Reviewer Weighted Bugs Found Unweighted Detection Rate
Claude 1,511 / 2,603 (58.05%) 61.65%
Codex 985 / 2,603 (37.84%) 43.17%
Gemini 724 / 2,603 (27.81%) 34.64%

Why two metrics:

  • Unweighted = mean of per-challenge detection rates (each challenge counts equally).
  • Weighted = global bug-level recall (sum(bugs_found) / sum(total_bugs)).

For a concise write-up, see REPORT.md.

Usage

uv sync
uv run cheddar challenge claude slugify      # inject bugs
uv run cheddar review codex slugify -c <id>  # review blind
uv run cheddar match -c <id>                 # score (single run)
uv run cheddar match -c <id> --repeat 3 --aggregate median  # score (median-of-3)

Scoring

An LLM judge (gpt-5.2) performs one global assignment per review: all injected bugs vs reviewer bugs/*.json payloads. The judge is instructed to match only when file/location/mechanism align and to leave uncertain bugs unmatched. Each match includes a supporting quote and line from reviewer findings.

To reduce judge stochasticity, scoring uses repeats with median aggregation (--repeat 5 --aggregate median).

Scoring uses raw reviewer finding payloads as produced by agents.

Agents

We tested official CLI tools with autonomous permission flags enabled on sandboxed challenge repositories.

Agent Model CLI Flags
Claude Code claude-opus-4-6 --dangerously-skip-permissions --output-format stream-json --verbose
Codex CLI gpt-5.3-codex exec --dangerously-bypass-approvals-and-sandbox --skip-git-repo-check --json -c model_reasoning_effort='medium'
Gemini CLI gemini-3-pro-preview --yolo

Model versions are configured in src/cheddar/agents/config.py; CLI command flags are defined in the per-agent wrappers under src/cheddar/agents/*.py.

PRs welcome for additional agents.

Dataset

Benchmark corpus: 50 open-source utility libraries across JS/TS/Python/Go/C/Ruby/Rust/Java/C#. Source repos live under repos/ (uv run cheddar list repos).

Reference snapshot: