Settings

Theme

Show HN: Cheddar-bench – unsupervised benchmark for coding agents

github.com

9 points by przadka 2 months ago · 0 comments · 1 min read

Reader

I built a small benchmark to test CLI coding agents on blind bug detection.

A challenger agent injects bugs and writes ground truth (`bugs.json`). A different reviewer agent audits the repo without seeing ground truth, and an LLM matcher scores bug-to-finding assignments.

Current run: 50 repos, 150 challenges, 450 reviews, 2,603 injected bugs.

Weighted detection: Claude 58.05%, Codex 37.84%, Gemini 27.81%.

LLM-judge benchmarks are easy to get wrong, so I’d really appreciate critical feedback on benchmark fairness, scoring/matching methodology, and obvious failure modes I’m missing.

Full dataset is linked in the docs.

No comments yet.

Keyboard Shortcuts

j
Next item
k
Previous item
o / Enter
Open selected item
?
Show this help
Esc
Close modal / clear selection