MillenniumPrizeProblemBench is designed to track how quickly frontier models close the gap between today’s capabilities and the level of reasoning suggested by the Millennium Prize Problems.
MillenniumPrizeProblemBench
Experimental research benchmark Scores are preliminary and based on internal evaluations
MillenniumPrizeProblemBench
Stress-testing frontier AI on the hardest math we know.A benchmark that model performance on structured problem-solving pipelines by tracking their progress solving the seven Millennium Prize Problems: proof search, conjecture generation, formal verification, and research-grade reasoning.
6 frontier models — GPT, Claude, Gemini, Llama, Mistral…
7 tracks — one for each Millennium Prize Problem.
Pass / Fail only — no partial credit on any track.
Status: no current model passes a single Millennium-inspired track.
Problem: Decide whether every problem whose solution can be
verified quickly (NP) can also be solved quickly (P), or give a proof that
P ≠ NP.
Benchmark: tasks center on structured reductions, proof
sketches, and complexity reasoning without claiming to resolve P vs NP.
Riemann Hypothesis
Click to expand
Problem: Show that all nontrivial zeros of the Riemann zeta
function lie on the critical line Re(s) = 1/2.
Benchmark: synthetic tasks in analytic number theory,
conjecture mining, and reasoning about zero distributions & L-functions.
Yang–Mills / Mass Gap
Click to expand
Problem: Construct a quantum Yang–Mills theory on
four-dimensional spacetime and prove the existence of a positive mass gap.
Benchmark: PDE and field-theory surrogates that test
reasoning about gauge symmetries, energy bounds, and toy mass-gap arguments.
Navier–Stokes
Click to expand
Problem: Prove or disprove global existence and smoothness
for solutions of the 3D incompressible Navier–Stokes equations with smooth
initial data.
Benchmark: toy fluid-dynamics PDE tasks about blow-up,
regularity heuristics, and simplified existence arguments.
Birch & Swinnerton-Dyer
Click to expand
Problem: Relate the arithmetic of an elliptic curve
(its rank) to the order of vanishing of its L-function at s = 1.
Benchmark: tasks over elliptic curves, rational points,
and L-function heuristics that mirror some of the structure of BSD.
Hodge Conjecture
Click to expand
Problem: Determine whether certain cohomology classes on
projective algebraic varieties are algebraic cycles.
Benchmark: synthetic tasks in cohomology, curvature, and
geometry intuition designed to echo the flavor of Hodge-theoretic arguments.
Topological Surrogates
Click to expand
Problem: Inspired by the (now resolved) Poincaré
conjecture on three-dimensional manifolds, used here as a stand-in for
deep open problems in topology.
Benchmark: toy 3-manifold and homotopy-style tasks that
stress high-level geometric and topological reasoning.
Methodology: results indicate whether models consistently pass or fail synthetic task suites that mimic structural aspects of the Millennium Prize Problems (e.g. formal proof steps, conjecture search, counterexample discovery, and self-critique), evaluated via automated checkers and human expert review. No model here is claimed to have solved any genuine Millennium Prize Problem.
Future model performance
While current models fail every MillenniumPrizeProblemBench track, recent benchmark history suggests that performance can improve rapidly once a capability becomes an optimization target. It is plausible that future systems will eventually achieve non-trivial pass rates on synthetic tasks that mirror aspects of the Millennium Problems. Passing multiple tracks would indicate strong performance on closed-ended, verifiable mathematical reasoning, but it would not by itself imply autonomous research capabilities or “artificial general intelligence”. MillenniumPrizeProblemBench focuses on structured proof-style problems rather than open-ended research or creative discovery, making it a targeted measure of technical reasoning under strict verification.
Impact
By providing a clear, pass-or-fail view of progress on Millennium-inspired tasks, MillenniumPrizeProblemBench offers a common reference point for researchers, labs, and policymakers when assessing model capabilities. This can support more grounded discussions about development trajectories, potential risks, and appropriate governance measures. Even if no model comes close to resolving the true Millennium Problems, tracking performance on structurally similar benchmarks helps clarify where today’s systems excel, where they still break, and which kinds of mathematical reasoning remain firmly out of reach.