RISC-V · RV32IM · single-issue · FPGA-grounded
An unbounded benchmark for LLM hardware engineering. Large language models design RISC-V CPUs from scratch. Every design must first pass a full battery of formal correctness proofs, so buggy CPUs are thrown out. The ones that survive are then scored by how fast they would actually run on a physical FPGA.
Thesis
SWE-bench tops out at 100%. HWE Bench doesn't have a top.
The fitness number reflects an actual microarchitecture, and microarchitecture
has room to grow as long as models keep finding it.
Speed vs size
Score × Area
Leaderboard
Peak fitness per model
| # | Model | Reps | Best | Δ% | Mean ± std | Area (LUT4) | Fmax (MHz) |
|---|---|---|---|---|---|---|---|
| 1 | gpt-5_5_xhigh | 3/3 | 525.04 | +85.6% | 468.3 ± 52.8 | 5.5k | 220 |
| 2 | gpt-5_4_xhigh | 2/2 | 513.84 | +81.7% | 505.0 ± 8.9 | 10.1k | 203 |
| 3 | gpt-5_5_high | 3/3 | 461.87 | +63.3% | 430.2 ± 23.0 | 9.8k | 187 |
| 4 | gpt-5_5_medium | 3/3 | 431.58 | +52.6% | 423.5 ± 11.2 | 7.8k | 201 |
| 5 | kimi-k2_6 | 2/3 | 396.13 | +40.1% | 339.5 ± 8.3 | 9.9k | 166 |
| 6 | VexRiscv (human ref) | n/a | 370.00 | +30.8% | n/a | 3.4k | 144 |
| 7 | gemini-3_1-pro | 3/3 | 354.73 | +25.4% | 339.4 ± 12.6 | 10.2k | 150 |
| 8 | baseline V0 (fixture) | n/a | 282.82 | n/a | n/a | 9.6k | 127 |
The VexRiscv row is the human-engineered reference, a well-known open-source RV32IM CPU synthesized on the same FPGA used for the benchmark. 5 of the LLM-generated designs beat it. See the methodology page for the full procedure.
Why unbounded
SWE-bench saturates. HWE Bench doesn't.
Most LLM benchmarks have a fixed ceiling. SWE-bench tops out at 100% issue-resolution. Multiple-choice evals approach 99%. Once a model lands at the ceiling, every subsequent model gets the same score, and the benchmark stops being useful for tracking capability.
HWE Bench has no ceiling. Fitness is the CPU's actual speed running CoreMark on a real FPGA, operating frequency times instructions-per-cycle (Fmax × IPC for the technically inclined). There's no theoretical maximum: a smarter microarchitecture always scores higher. As long as models keep finding new tricks (deeper pipelines, smarter branch predictors, restructured ALUs), the leaderboard keeps moving.
Empirically: the current best is 525.04 iter/s, +85.6% over the V0 baseline core, and clear of the VexRiscv human reference. There is no theoretical ceiling, and within current budgets the curve has not saturated.
Trajectory