LLM System Design Benchmark

1 min read Original article ↗

This benchmark evaluates how well different LLMs perform on system design tasks. Each model receives the same cold system design prompt — no examples, no hints — and produces a complete design with architecture, capacity estimation, tradeoffs, and failure analysis. Independent LLM judges then score every transcript on 5 dimensions.

I evaluated 10 models on 11 problems with 3 judges — 110 transcripts scored in total. See the methodology.

Any feedback or request? Please submit an issue.

RankModelMean Score±CIRuns
1glm-5.24.37±0.072
2kimi-k2.64.36±0.1111
3gpt-5.44.32±0.1411
4claude-sonnet-4.64.26±0.0711
5gpt-oss-120b4.02±0.0911
6gemini-3.1-pro3.86±0.1111
7deepseek-v4-pro3.82±0.3211
8gemma-4-31b-it3.38±0.1611
9gpt-oss-20b3.36±0.1311
10minimax-m2.73.33±0.2911

Buy me a coffee — or 10M tokens worth ☕