I benchmarked 4 coding agents on an NP-hard problem I solved 8 years ago

3 points by couAUIA 4 months ago · 2 comments

Reader

About 15% of all trials produced completely invalid outputs.

Are you smart enough to pick out one out of seven "AI" solutions that's completely invalid? I'm not in every context. But this leaves out trials that produced sort of, partially and kind of invalid outputs. Can you pick out those? Oh well, I guess "good enough" will have to do. Chabuduo.

couAUIAOP 4 months ago

I gave an unpublished fiber network optimization problem to Claude Code, Codex, Gemini CLI, and Mistral. The score is total fiber length (lower is better). A good human solution in 30 minutes: ~40,000. My best after days of C++: 34,123. Given one hour, Claude Code hit 34,061 — beating me by 62 points. A 7-word prompt hint improved every agent by 18-30%. About 15% of all trials produced completely invalid outputs.

Settings

I benchmarked 4 coding agents on an NP-hard problem I solved 8 years ago

Keyboard Shortcuts