First Proof | Research-Level Math for AI Evaluation

4 min read Original article ↗

A set of ten math questions to evaluate the capabilities of AI systems to autonomously solve problems that arise naturally in the research process.

About the Project

In baking, the first proof, or bulk fermentation process, is a crucial step in which one lets the entire batch of dough ferment as one mass, before dividing and shaping it into loaves.

This project represents our preliminary efforts to develop an objective and realistic methodology for assessing the capabilities of AI systems to autonomously solve research-level math questions. After letting these ideas ferment in the community, we hope to produce a more structured benchmark.

We present a diverse set of 10 research-level math questions, drawn from algebraic combinatorics, spectral graph theory, algebraic topology, stochastic analysis, symplectic geometry, representation theory, lattices in Lie groups, tensor analysis, and numerical linear algebra. Each question arose naturally in the research process of the authors and has been answered with a proof of roughly five pages or less, but the answers have not yet been posted online.

For the next batch, we will implement a benchmarking phase prior to the community release. The benchmark phase will be designed to ensure the following features:

  • Verification that the solutions are produced autonomously.
  • A formal grading scheme and refereeing, modeled on the journal review system.
  • An explicit description of the problem selection process, including advance internal testing on systems which have a zero data retention policy.

If you are interested in an assessment of your solutions to the next round of questions, please email contact@1stproof.org.

We will provide details about the design of the next round on March 14, 2026. After the formal phase of the second batch of problems, we will include another informal community experimentation phase to generate further discussion. We hope to inform the design of this phase based on the feedback we receive from the community.

Questions & Resources

Read the Paper

Our methodology, the complete set of questions, and discussion of related work.

View Paper on arXiv

LaTeX Source

The LaTeX source of the paper, including the problem statements.

View LaTeX Source

Solutions

Solutions were released at 11:59pm Pacific Time on February 13, 2026. These include the author solutions, a link to the original encrypted solutions together with the key to unlock them, and the AI solutions produced by the project team.

View Solution Files

We invite the community to experiment with our ten questions and to share their results and observations online. Ideally, participants should share a complete transcript of their interaction with an AI system. The most credible solutions will be those that were completed before the solutions were officially released.

We are thrilled about the excitement this project has generated, and we are grateful to the community for engaging with us. ICARM has generously agreed to host a web-public Zulip channel in which discussions of the solutions will be hosted. Some questions to seed the discussion are the following. How do various prompting strategies compare for each question? Are there harnessing or scaffolding strategies that succeed in improving model outputs? Does the success of such methods depend on the mathematical area? How do we define an autonomously produced solution, and how do we guarantee it? How should solutions be graded?

We encourage participants to share these questions and their findings on social media using the hashtag #1stProof.

Note on solutions: we consider that an AI model has answered one of our questions if it can produce in an autonomous way a proof that conforms to the levels of rigor and scholarship prevailing in the mathematics literature. In particular, the AI should not rely on human input for any mathematical idea or content, or to help it isolate the core of the problem. Citations should include precise statement numbers and should either be to articles published in peer-reviewed journals or to arXiv preprints.

Get Involved

If you are a mathematician interested in contributing future problem sets to 1st Proof, please reach out to us at contact@1stproof.org.

Euler Day — February 7, 2026

On Euler Day, we celebrate Leonhard Euler's 1768 method for approximating solutions to differential equations—the same fundamental approach that underlies gradient descent in modern machine learning. Read more →

Team for February 2026 Release

Mohammed Abouzaid
Stanford University

Andrew J. Blumberg
Columbia University

Martin Hairer
EPFL and Imperial

Joe Kileel
University of Texas at Austin

Tamara G. Kolda
MathSci.ai

Paul D. Nelson
Aarhus University

Daniel Spielman
Yale University

Nikhil Srivastava
University of California, Berkeley

Rachel Ward
University of Texas at Austin

Shmuel Weinberger
University of Chicago

Lauren Williams
Harvard University