I benchmarked 4 coding agents on an NP-hard problem I solved 8 years ago
charlesazam.comAbout 15% of all trials produced completely invalid outputs.
Are you smart enough to pick out one out of seven "AI" solutions that's completely invalid? I'm not in every context. But this leaves out trials that produced sort of, partially and kind of invalid outputs. Can you pick out those? Oh well, I guess "good enough" will have to do. Chabuduo.
I gave an unpublished fiber network optimization problem to Claude Code, Codex, Gemini CLI, and Mistral. The score is total fiber length (lower is better). A good human solution in 30 minutes: ~40,000. My best after days of C++: 34,123. Given one hour, Claude Code hit 34,061 — beating me by 62 points. A 7-word prompt hint improved every agent by 18-30%. About 15% of all trials produced completely invalid outputs.