This benchmark evaluates how well different LLMs perform on system design tasks. Each model receives the same cold system design prompt — no examples, no hints — and produces a complete design with architecture, capacity estimation, tradeoffs, and failure analysis. Independent LLM judges then score every transcript on 5 dimensions.
I evaluated 10 models on 11 problems with 3 judges — 110 transcripts scored in total. See the methodology.
Any feedback or request? Please submit an issue.
| Rank | Model | Mean Score | ±CI | Runs |
|---|---|---|---|---|
| 1 | glm-5.2 | 4.37 | ±0.07 | 2 |
| 2 | kimi-k2.6 | 4.36 | ±0.11 | 11 |
| 3 | gpt-5.4 | 4.32 | ±0.14 | 11 |
| 4 | claude-sonnet-4.6 | 4.26 | ±0.07 | 11 |
| 5 | gpt-oss-120b | 4.02 | ±0.09 | 11 |
| 6 | gemini-3.1-pro | 3.86 | ±0.11 | 11 |
| 7 | deepseek-v4-pro | 3.82 | ±0.32 | 11 |
| 8 | gemma-4-31b-it | 3.38 | ±0.16 | 11 |
| 9 | gpt-oss-20b | 3.36 | ±0.13 | 11 |
| 10 | minimax-m2.7 | 3.33 | ±0.29 | 11 |
Buy me a coffee — or 10M tokens worth ☕