MillenniumPrizeProblemBench

Experimental research benchmark Scores are preliminary and based on internal evaluations

MillenniumPrizeProblemBench
Stress-testing frontier AI on the hardest math we know.

A benchmark that model performance on structured problem-solving pipelines by tracking their progress solving the seven Millennium Prize Problems: proof search, conjecture generation, formal verification, and research-grade reasoning.

6 frontier models — GPT, Claude, Gemini, Llama, Mistral…

7 tracks — one for each Millennium Prize Problem.

Pass / Fail only — no partial credit on any track.

Status: no current model passes a single Millennium-inspired track.

Problem: Decide whether every problem whose solution can be verified quickly (NP) can also be solved quickly (P), or give a proof that P ≠ NP.
Benchmark: tasks center on structured reductions, proof sketches, and complexity reasoning without claiming to resolve P vs NP.

Riemann Hypothesis

Click to expand

Problem: Show that all nontrivial zeros of the Riemann zeta function lie on the critical line Re(s) = 1/2.
Benchmark: synthetic tasks in analytic number theory, conjecture mining, and reasoning about zero distributions & L-functions.

Yang–Mills / Mass Gap

Click to expand

Problem: Construct a quantum Yang–Mills theory on four-dimensional spacetime and prove the existence of a positive mass gap.
Benchmark: PDE and field-theory surrogates that test reasoning about gauge symmetries, energy bounds, and toy mass-gap arguments.

Navier–Stokes

Click to expand

Problem: Prove or disprove global existence and smoothness for solutions of the 3D incompressible Navier–Stokes equations with smooth initial data.
Benchmark: toy fluid-dynamics PDE tasks about blow-up, regularity heuristics, and simplified existence arguments.

Birch & Swinnerton-Dyer

Click to expand

Problem: Relate the arithmetic of an elliptic curve (its rank) to the order of vanishing of its L-function at s = 1.
Benchmark: tasks over elliptic curves, rational points, and L-function heuristics that mirror some of the structure of BSD.

Hodge Conjecture

Click to expand

Problem: Determine whether certain cohomology classes on projective algebraic varieties are algebraic cycles.
Benchmark: synthetic tasks in cohomology, curvature, and geometry intuition designed to echo the flavor of Hodge-theoretic arguments.

Topological Surrogates

Click to expand

Problem: Inspired by the (now resolved) Poincaré conjecture on three-dimensional manifolds, used here as a stand-in for deep open problems in topology.
Benchmark: toy 3-manifold and homotopy-style tasks that stress high-level geometric and topological reasoning.

Methodology: results indicate whether models consistently pass or fail synthetic task suites that mimic structural aspects of the Millennium Prize Problems (e.g. formal proof steps, conjecture search, counterexample discovery, and self-critique), evaluated via automated checkers and human expert review. No model here is claimed to have solved any genuine Millennium Prize Problem.

MillenniumPrizeProblemBench is designed to track how quickly frontier models close the gap between today’s capabilities and the level of reasoning suggested by the Millennium Prize Problems.

Future model performance

While current models fail every MillenniumPrizeProblemBench track, recent benchmark history suggests that performance can improve rapidly once a capability becomes an optimization target. It is plausible that future systems will eventually achieve non-trivial pass rates on synthetic tasks that mirror aspects of the Millennium Problems. Passing multiple tracks would indicate strong performance on closed-ended, verifiable mathematical reasoning, but it would not by itself imply autonomous research capabilities or “artificial general intelligence”. MillenniumPrizeProblemBench focuses on structured proof-style problems rather than open-ended research or creative discovery, making it a targeted measure of technical reasoning under strict verification.

Impact

By providing a clear, pass-or-fail view of progress on Millennium-inspired tasks, MillenniumPrizeProblemBench offers a common reference point for researchers, labs, and policymakers when assessing model capabilities. This can support more grounded discussions about development trajectories, potential risks, and appropriate governance measures. Even if no model comes close to resolving the true Millennium Problems, tracking performance on structurally similar benchmarks helps clarify where today’s systems excel, where they still break, and which kinds of mathematical reasoning remain firmly out of reach.