LLM System Design Benchmark

This benchmark evaluates how well different LLMs perform on system design tasks. Each model receives the same cold system design prompt — no examples, no hints — and produces a complete design with architecture, capacity estimation, tradeoffs, and failure analysis. Independent LLM judges then score every transcript on 5 dimensions.

I evaluated 10 models on 11 problems with 3 judges — 110 transcripts scored in total. See the methodology.

Any feedback or request? Please submit an issue.

Rank	Model	Mean Score	±CI	Runs
1	glm-5.2	4.37	±0.07	2
2	kimi-k2.6	4.36	±0.11	11
3	gpt-5.4	4.32	±0.14	11
4	claude-sonnet-4.6	4.26	±0.07	11
5	gpt-oss-120b	4.02	±0.09	11
6	gemini-3.1-pro	3.86	±0.11	11
7	deepseek-v4-pro	3.82	±0.32	11
8	gemma-4-31b-it	3.38	±0.16	11
9	gpt-oss-20b	3.36	±0.13	11
10	minimax-m2.7	3.33	±0.29	11

Buy me a coffee — or 10M tokens worth ☕