HWE Bench · RISC-V CPU design benchmark for LLMs

4 min read Original article ↗

RISC-V · RV32IM · single-issue · FPGA-grounded

An unbounded benchmark for LLM hardware engineering. Large language models design RISC-V CPUs from scratch. Every design must first pass a full battery of formal correctness proofs, so buggy CPUs are thrown out. The ones that survive are then scored by how fast they would actually run on a physical FPGA.

Thesis SWE-bench tops out at 100%. HWE Bench doesn't have a top.
The fitness number reflects an actual microarchitecture, and microarchitecture has room to grow as long as models keep finding it.

Speed vs size

Score × Area

300 400 500 5.0k 7.5k 10.0k baseline V0 · 283 · 9.6k LUT Fitness (CoreMark iter/s) Area · LUT4 count (← smaller is better) gpt-5_5_xhigh 525 · 5.5k LUT gpt-5_4_xhigh 514 · 10.1k LUT gpt-5_5_high 462 · 9.8k LUT gpt-5_5_medium 432 · 7.8k LUT kimi-k2_6 396 · 9.9k LUT gemini-3_1-pro 355 · 10.2k LUT VexRiscv human ref · 370 · 3.4k LUT
Vertical axis: CoreMark fitness (how fast the CPU runs the benchmark). Horizontal axis: chip area (LUT4 count, basically how many gates the design uses on the FPGA). One point per model's best run. VexRiscv (3,957 LUT4 · fitness 370) is the human-engineered reference. Up and to the left is the goal: faster chip, smaller chip.

Leaderboard

Peak fitness per model

Best of N=3 reps per model · 17 reps total · VexRiscv human reference in red · baseline V0 in italic
# Model Reps Best Δ% Mean ± std Area (LUT4) Fmax (MHz)
1 gpt-5_5_xhigh 3/3 525.04 +85.6% 468.3 ± 52.8 5.5k 220
2 gpt-5_4_xhigh 2/2 513.84 +81.7% 505.0 ± 8.9 10.1k 203
3 gpt-5_5_high 3/3 461.87 +63.3% 430.2 ± 23.0 9.8k 187
4 gpt-5_5_medium 3/3 431.58 +52.6% 423.5 ± 11.2 7.8k 201
5 kimi-k2_6 2/3 396.13 +40.1% 339.5 ± 8.3 9.9k 166
6 VexRiscv (human ref) n/a 370.00 +30.8% n/a 3.4k 144
7 gemini-3_1-pro 3/3 354.73 +25.4% 339.4 ± 12.6 10.2k 150
8 baseline V0 (fixture) n/a 282.82 n/a n/a 9.6k 127

The VexRiscv row is the human-engineered reference, a well-known open-source RV32IM CPU synthesized on the same FPGA used for the benchmark. 5 of the LLM-generated designs beat it. See the methodology page for the full procedure.

Why unbounded

SWE-bench saturates. HWE Bench doesn't.

Most LLM benchmarks have a fixed ceiling. SWE-bench tops out at 100% issue-resolution. Multiple-choice evals approach 99%. Once a model lands at the ceiling, every subsequent model gets the same score, and the benchmark stops being useful for tracking capability.

HWE Bench has no ceiling. Fitness is the CPU's actual speed running CoreMark on a real FPGA, operating frequency times instructions-per-cycle (Fmax × IPC for the technically inclined). There's no theoretical maximum: a smarter microarchitecture always scores higher. As long as models keep finding new tricks (deeper pipelines, smarter branch predictors, restructured ALUs), the leaderboard keeps moving.

Empirically: the current best is 525.04 iter/s, +85.6% over the V0 baseline core, and clear of the VexRiscv human reference. There is no theoretical ceiling, and within current budgets the curve has not saturated.

Trajectory

Fitness over rounds, best rep per model

300 400 500 R0 R5 R10 R15 baseline 283 VexRiscv 370 Best fitness so far Round (1 hypothesis × 3 slots each) gpt-5_5_xhigh 525 at R15 gpt-5_4_xhigh 514 at R15 gpt-5_5_high 462 at R15 gpt-5_5_medium 432 at R15 kimi-k2_6 396 at R15 gemini-3_1-pro 355 at R15
Running max of CoreMark fitness across the 15 hypothesis rounds for each model's best-performing rep. Lines step up when a winning hypothesis lands and stay flat otherwise. VexRiscv's human-reference fitness is the red dashed line; the baseline V0 core is the gray dashed line.