Formal Reasoning Meets LLMs: Toward AI for Mathematics and Verification

19 min read Original article ↗

Formal reasoning encodes knowledge using formal languages, deriving conclusions through symbol manipulation using well-defined inference rules. Its theoretical foundations were established in the early days of computer science by pioneers such as Gödel and Turing.8,34 Today, formal reasoning underpins many core areas of computer science, including knowledge representation in databases and the formal verification of software, hardware, and cyber-physical systems.a

Early AI research was also deeply rooted in formal reasoning. The first AI program in history—Newell and Simon’s Logic Theorist24—was designed to automate mathematical reasoningb using formal logic. Over the past few decades, the center of AI has shifted from symbolic methods to machine learning, with large language models (LLMs)—massive neural networks trained on Internet-scale data—now dominating the field. While LLMs were originally motivated by problems in natural language processing, they have recently shown intriguing promise in mathematical reasoning. Notably, models developed by OpenAI and Google DeepMind reportedly reached gold medal performance in the 2025 International Mathematical Olympiad (IMO).c

The mathematical reasoning performed by LLMs is fundamentally different from the rule-based symbolic methods in traditional formal reasoning. Figure 1 illustrates a common approach for LLM-based mathematical problem solving. The process begins with training LLMs on mathematical data. This is typically done by further pretraining an existing LLM (already pretrained on Internet-scale text data) on mathematical webpages such as arXiv papers and MathOverflow discussions. The model is then finetuned on another curated dataset, consisting of math problems paired with detailed step-by-step solutions that may include calls to external tools like Python. Once the training completes, the model can be used during inference by predicting the next word autoregressively. This “informal” approach offers unique advantages over traditional formal reasoning based purely on symbolic methods. First, traditional formal reasoning requires manually encoding problems and world knowledge into symbolic rules—a challenge that has been attempted for decades but has proven to be extremely difficult, if not impossible. In contrast, LLMs can learn world knowledge during training and can reason without explicit formalization. Second, reasoning and proof search in formal systems have traditionally relied on handcrafted heuristics, which tend to be brittle and struggle to generalize across domains. LLMs, however, can learn reasoning patterns directly from data, making them more flexible and adaptive. This raises the question: Given the effectiveness of informal reasoning with LLMs, do we still need formal reasoning at all?

Figure 1.  A common approach to the training and inference of math LLMs. During inference, the model repeatedly generates the next word in the solution until a special “end of response” token is generated. It can also outsource certain computations to external tools such as Python.

But despite their strengths, LLMs exhibit reasoning deficiencies that formal methods can help address. First, LLMs depend heavily on large, high-quality training datasets, making it challenging to apply them in data-scarce domains such as advanced research-level mathematics or the verification of novel software systems. Second, LLMs offer no guarantees of sound reasoning and are prone to “hallucinating” false or nonsensical outputs. As Buzzard argues in a recent blog post analyzing the performance of OpenAI’s o1 model in the Putnam mathematics competition,4 LLMs often do well at guessing final answers to math problems but fail at rigorously proving the validity of these answers. In contrast, formal systems allow for rigorous step-by-step proof and the ability to attribute incorrect conclusions to inconsistent axioms or flawed inferences.

Given the complementary strengths of formal methods and LLMs, a natural approach to building future reasoning systems is to integrate them. That is, we could imagine using an LLM to generate proofs in a formal language and using a formal proof system to check these proofs. Feedback from formal systems could be used to rule out hallucinations. Moreover, formal proof systems could be used to generate unbounded amounts of synthetic data for training LLMs.

Two successful examples of this approach are AlphaProof9 and AlphaGeometry.33 These systems build on a rich body of research on integrating formal methods with machine learning in mathematical tasks, including neural theorem proving and translating from informal language to formal language (autoformalization). The advent of LLMs has significantly accelerated research in this area. LLMs now can predict proof steps,32 fix buggy proofs,7 and perform autoformalization without explicitly training on aligned informal-formal data.36 Meanwhile, the research ecosystem surrounding LLMs and formal reasoning continues to expand. Lean,5 a language for writing formal proofs, has attracted both mathematicians and AI researchers. LLMs have shown promise in assisting humans in writing Lean proofs,29 potentially initiating a data flywheel where growing human-written formal math data leads to more capable LLMs, which in turn eases the creation of more data.

Recent advances have sparked a surge of interest in applying AI to formal mathematical reasoning, with the number of publications in this field nearly doubling annually in both 2023 and 2024.17 In 2024, AlphaProof leveraged formal reasoning to become the first AI to achieve the silver-medal level in the IMO, followed by strong results from ByteDance and Harmonic in IMO 2025. Beyond mathematics, developments in this field have immediate applications in the formal verification of safety-critical systems.14 Historically, formal verification has been prohibitively expensive for all but the most high-stakes applications. AI can drastically reduce this cost by automating the formalization and proof effort needed to formally certify complex systems, potentially making mass-produced software and hardware systems far more robust than they are today.

We believe AI-based formal reasoning has reached an inflection point, with significant progress coming in the near future. In what follows, we introduce this field, outline open challenges in data and algorithms, and discuss promising directions for future research.

AI for Formal Mathematical Reasoning

Here, we introduce formal systems, with Lean5 as a representative example. We then discuss how AI-based formal reasoning can benefit mathematics, general reasoning, and applications, such as the design and verification of hardware and software systems.

Formal systems and Lean.  Formal systems provide computational models for manipulating formulas and deriving theorems based on symbolic inference rules. Examples include Peano arithmetic, higher-order logic,25 and dependent type theory.2 The consistency of these systems allows us to potentially ground AI reasoning within them, enabling the generation of feedback signals that are useful for training and evaluating AI models. Unlike human annotations, these signals can be fully automated, making them more cost-effective and dependable.

An important implementation of formal systems is proof assistants—software enabling humans to write formal proofs about mathematics or verified software. Common examples of proof assistants include Coq,2 Isabelle,25 and Lean. They have different logical foundations but similar user interfaces, so we use Lean as a representative example. Figure 2 shows how Lean is used to formalize mathematics. At its core, it is a programming language for writing not only conventional programs but also mathematical definitions, theorems, and proofs. Figure 2 (middle) shows a Lean file. After defining natural numbers (Nat) and addition (add), it states and proves the theorem add_zero (∀n ∊ ∈ N, 0+n =n). Lean can automatically check whether the proof is correct, that is, consistent with the theorem statement.

Figure 2.  Formalizing mathematics using the Lean proof assistant.5

Theorem proving in Lean is an interactive process (Figure 2, left). It begins with the statement as the initial goal, and the user enters a proof step, known as a tactic. Lean executes the tactic, transforming the goal into a list of subgoals. The user then inspects the new goals and enters new tactics, repeating this process until there are no goals left. This process implicitly defines a proof tree whose nodes are goals and edges are tactics. The user plays a key role here. Though proof assistants like Lean were designed with human users in mind, in formal mathematical reasoning, the user can also be AI or humans in collaboration with AI.

Formalizing mathematics using Lean is like developing software (Figure 2, right). Files are organized into larger code units such as libraries and projects, which can be open sourced on GitHub and reused by other projects. For example, the formalization of cutting-edge research often builds upon the basic concepts formalized in mathlib,22 Lean’s general-purpose mathematical library.

AI meets formal mathematics.  Integrating AI with proof assistants like Lean offers mutual benefits. Proof assistants provide data and environments for AI to learn math, while AI can enhance their usability by automating simple proofs. Figure 3 outlines key tasks in this area. One of these tasks, mentioned earlier, is autoformalization, which translates informal mathematics from textbooks or papers into formal theorems and proofs. A common approach uses LLMs to learn from a handful of expert-constructed informal-formal pairs as demonstrations.36 Another task is theorem proving, where AI generates formal proofs given theorem statements. Since the task requires complex reasoning and allows rigorous evaluation in Lean, it has become increasingly popular as a testbed for LLMs.32,38

Figure 3.  Tasks using AI for formal mathematical reasoning in proof assistants.

While formal reasoning offers clear benefits, it need not replace informal reasoning; rather, the two can complement each other to support complex, rigorous reasoning in a wide range of domains. For example, combining autoformalization with theorem proving can be useful for solving problems formulated in natural language.40 We refer to the combination of formal and informal reasoning as verified reasoning in natural language.

Moreover, formal mathematics can be applied to the verification of software and hardware systems.14 Here, one specifies the correctness/security requirements as formal statements and uses theorem proving to establish that the system satisfies its requirements. AI-driven autoformalization and theorem proving can potentially facilitate this process, significantly reducing its costs. The integration of AI and verification has been explored in the formal methods community; for example, Seshia28 proposed SID (structure, induction, and deduction), a framework that combines inductive and deductive reasoning. While this paper focuses primarily on these two modes of reasoning, other forms—such as counterfactual and analogical reasoning16—have also been studied in the context of mathematics and verification.

Recent Advances and Ongoing Challenges

Formal mathematical reasoning presents a wealth of challenging problems for AI. Here, we identify several open challenges and promising directions, urging the community to work together to advance progress in this field.

Data.  The simultaneous scaling of compute and training data has been the main driver of LLMs’ performance in various tasks. However, scaling the training data in formal mathematics is hampered by the scarcity of human-created formal proofs. The Proof Pile dataset,1 which aggregated proofs from six different formal languages, collected only 500MB of formal proofs—orders of magnitude smaller than its informal counterparts. The issue is more pronounced in research mathematics, where even informal data is limited.

Researchers are exploring different strategies to overcome data scarcity. The first is autoformalization. We have a substantial amount of informal math data. Since formal proofs can be verified easily, if a system successfully formalizes even a small subset of the informal math data, it can iteratively self-improve by learning from the additional formal data it has created, potentially covering an increasingly larger set with each iteration. Another approach is synthetic data generation. For example, AlphaGeometry’s training data consisted solely of synthetic geometry problems and solutions.33 Mathematical axioms entail all provable facts, in principle containing infinite data. By generating synthetic data, AI can potentially explore and learn from the vast space of possible mathematics, at a scale that can drastically surpass the pace of human-created training data. If the method can be generalized, it would help in completely new mathematical domains, where even informal data might be scarce.

Autoformalization and synthetic data were combined in AlphaProof,9 which autoformalized one million informal math competition problems into 100 million formal theorems, whose proofs were synthetically generated using expert iteration. It remains an open question to generalize this approach beyond domains in which a large number of human-written problems are available, for example, research mathematics. Those domains will likely require conjecturing new unseen statements.26

Another promising strategy is knowledge transfer from different modalities. Specifically, code is closely related to mathematics, as both require symbolic reasoning. This similarity has been exploited to improve AI’s general mathematical capabilities, though it is still an open question how to leverage data-rich programming languages such as Python to enhance reasoning in formal mathematical languages.

Algorithms

Autoformalization at sceal.  Since any given theorem can be formalized in multiple equivalent ways, a major bottleneck in autoformalization is evaluating whether the autoformalized statement is logically equivalent to the ground truth. Proxy metrics such as BLEU do not correlate well with human judgment,23 but relying on humans is not scalable. Possible ideas for better automated metrics include checking logical equivalence via automated provers.18,23

Autoformalization extends beyond mere translation; formalizing certain problems—such as the combinatorics problem (P5) at IMO 2024d—may necessitate complex reasoning, retrieving relevant definitions, or even formulating new ones. For such problems, it is natural to break down the reasoning process into smaller steps, for example, retrieving definitions before formalizing the statement or generating high-level sketches before formalizing the proof. We anticipate benefits from smaller steps and process supervision19,21 and call for autoformalization to be more interactive.

Proof search and test-time compute.  Search is fundamental in many reasoning systems. Neural theorem provers often combine tactic generation with proof search. Their search strategy ranges from independent sampling of multiple candidates to sophisticated algorithms like Monte Carlo Tree Search.15 Scaling search to exploit vast test-time compute has emerged as a promising approach for both formal and informal reasoning.

Many myths and trade-offs surrounding proof search remain unexplored. Is proof search indispensable, given that generating complete proofs can achieve lower latency?7 For a fixed compute budget, should we use smaller models with more search steps or larger models with fewer steps?3 How do different search algorithms compare? To answer these questions and guide the development of future provers, we need a systematic evaluation of existing methods, currently lacking due to the inherent challenge in fairly evaluating theorem proving in a unified way. It is unclear how to compare provers targeting different proof assistants. Even within the same proof assistant, a prover’s performance is multifaceted and depends on resource constraints (e.g., hardware and time limits), making it difficult to consolidate performance into a single metric. A comprehensive evaluation that addresses these complexities would be immensely valuable. Despite these challenges, researchers are exploring various directions to improve proof search, such as developing value models to assess the promise of proof goals.15

Proof search alone does not solve theorem proving, where a fundamental challenge is a discrete, infinite action space. Proof search cannot succeed if the model cannot produce high-quality actions in the first place. In the context of theorem proving, mathematical creativity can manifest as actions exceeding the current model’s capabilities, akin to the “divine move ( )”—a legendary concept in Go. We would not expect to find them if the action space were an infinite, unstructured list. Fortunately, mathematics is structured, making it possible—though still challenging—to find the divine moves. Next, we discuss several ways of leveraging structures in mathematics.

Exploiting hierarchies and learning abstractions.  Theorems build upon smaller lemmas, which in turn break down into even simpler subgoals. Several existing theorem provers leverage this hierarchical structure. Draft, Sketch, and Prove11 transformed informal proofs into formal “proof sketches”—skeletons of formal proofs with “holes,” that is, open goals left unproven, yielding a hierarchical structure. POETRY35 recursively decomposed goals in proof sketches using an LLM. While these works demonstrated the potential of hierarchical decomposition, it is still a significant challenge to decompose realistic high-level goals with current LLMs.

Abstraction is central to human mathematical practice. We first learn natural numbers through counting; years later, those operations show up in solving equations but do not require as much attention anymore. In interactive theorem proving, abstractions can be encapsulated in new definitions, lemmas, and tactics. While most existing methods assume they are predefined and fixed, recent work has explored learning abstractions. For instance, LEGO-Prover37 used LLMs to propose and prove new lemmas, integrating them into its library to help prove further theorems. Lemma mining from existing proof corpora has also been explored.12 These lemmas, not explicitly factored out by humans, are still useful for automation. On learning tactics, Peano27 has proposed to learn simple proof strategies from an agent’s own solutions to past problems, in a bootstrapping fashion. However, these approaches have so far been limited to simpler formal systems, and it is still an open challenge to synthesize entirely new tactics in full-fledged formal theorem-proving languages like Lean.

Incorporating external knowledge.  Formal mathematical reasoning can benefit from explicitly retrieving and incorporating knowledge from databases of existing lemmas/definitions. ReProver38 and COPRA32 demonstrated performance gains through standard retrievers such as Dense Passage Retrieval.13 A promising direction involves developing retrievers tailored to formal math, for example, structured or neurosymbolic retrievers. Another avenue is to grow the knowledge base dynamically. For example, the system could decompose high-level proof goals into subgoals, cache a subset of these subgoals as modules, and use them in subsequent proof efforts. Deciding which subgoals are “interesting” enough to be modularized in this way is a challenge.

Formal verification and verifiable generation.  Like the field of AI4Math, we envision a growing need for formal reasoning in AI-based software and hardware generation, with assurance of correctness and security. While syntactical correctness can be guaranteed by constrained decoding, ensuring semantic properties, such as those validated by compilers, remains a challenge. Moreover, formal reasoning can help programmers understand AI-generated code.6

Formal verification poses unique challenges. For example, a necessary but challenging step is encoding the target system and the correctness requirements in the proof assistant, similar to formalizing mathematics. However, while mathematical statements tend to assert properties of established mathematical objects, statements in formal verification typically concern bespoke procedures and datatypes. Also, proofs tend to be more repetitive and heavy on case-splits and inductive reasoning about recursive functions and datatypes. Finally, unlike statements in mathematics research, real-world software and hardware systems are characterized by large codebases and frequent changes. For instance, seL414 consists of about 200,000 lines of specifications and proofs in Isabelle. Verifying these systems requires not only theorem proving but also rigorous management of specifications and proofs—an exciting yet underexplored direction for AI.

It is natural to couple formal verification and AI-based generation to simultaneously generate code, formal specifications (i.e., pre/post-conditions, loop invariants, and helper assertions), and proofs. Then a program verifier or a theorem prover can check if the code is consistent with the specifications and proofs. This approach has been explored in recent research30 and can potentially reduce verification efforts and enhance software and hardware reliability. However, a key challenge is to ensure the trustworthiness of the generated specifications—that they accurately reflect developers’ intent.

Milestones and Success Measures

We envision key milestones for measuring success in areas like autoformalization and theorem proving. Achieving them effectively requires collaborative research and open source efforts that provide datasets, AI systems, and shared infrastructure. By fostering a community-driven approach, we can accelerate progress in AI for formal mathematical reasoning and broaden its impact.

Autoformalization.  LLMs can readily generate autoformalization candidates that are usable after human review and refinement. To bootstrap autoformalization from this process, we need a system for collecting and storing human feedback, creating a continuously evolving dataset of synchronized informal and formal knowledge. This dataset will support the midterm goal of robust and faithful translations between informal and formal statements. In the long term, AI-based autoformalization should go beyond direct translation. For example, it should handle implicit assumptions and hand-waived proofs that pose challenges in formalizing mathematics. The model should be capable of inferring missing information and identifying cases where gaps cannot be resolved, necessitating strong reasoning capabilities. Additionally, it should detect and correct errors or inconsistencies by engaging with humans to clarify intent and refine its understanding.

Theorem proving.  Current machine learning methods can assist humans in formal proof development by suggesting useful definitions, lemmas, and proof steps.29 They can also autonomously prove simple theorems and fill small gaps in larger proofs in a domain-general manner.38 A natural next step is advancing toward more complex tasks, such as autonomously contributing to large-scale formalization projects. This requires AI to move beyond individual proofs to decompose larger results, propose new definitions and lemmas, and explore alternatives as the project develops. Evaluating such capabilities may require new benchmarks derived from real-world formalization efforts, incorporating GitHub metadata such as issues and commits. In the long term, we can envision AI systems that not only solve mathematical problems but also discover new mathematics beyond human capabilities. A key challenge will be developing meaningful metrics to track progress toward this open-ended goal, as current evaluations primarily measure proficiency in existing mathematics.

Verified reasoning in natural language.  Existing approaches to incorporate verification into natural language reasoning either rely on neural verifiers, which can be brittle and lack formal guarantees, or attempt to fully formalize the problem,40 limiting their applicability. Future models should reason seamlessly across natural language and formal systems such as Lean. By interleaving natural language with formal reasoning, they should selectively determine which parts of reasoning to process using formal systems. Such an approach could enable complex mathematical reasoning and planning in real-world applications that integrate mathematical components with elements that are challenging to formalize, such as common sense and human preferences.

Formal verification and verifiable generation.  Current LLMs still struggle with verifying simple properties of small programs,20 let alone synthesizing formally verified code—a more nascent task for which we do not have a large-scale, high-quality benchmark to evaluate AI models. Therefore, developing benchmarks and methods for small programs is an important goal in the short to mid-term.31,39 In the long term, AI should be able to verify and synthesize entire projects with complex functional and security properties. This capability requires decomposing large systems into smaller verifiable components, a task currently performed by humans but may be tackled by advanced AI agents capable of planning and problem solving to navigate the intricate dependencies and interactions in large codebases. Additionally, future AI should help users generate, explain, and debug formal specifications interactively.

Conclusion

In this article, we advocated for formal mathematical reasoning as an important complement to the informal approach, highlighting its potential to advance AI in mathematics and verifiable system design. We hope we have presented coherent perspectives that unite previously fragmented efforts across different fields, fostering discussion, community building, and a future roadmap.

Acknowledgments

We gratefully acknowledge Jeremy Avigad, Albert Q. Jiang, Zhaoyu Li, Peter O’Hearn, Daniel Selsam, Sanjit A. Seshia, Armando Solar-Lezama, and Terence Tao for providing valuable feedback on an initial version of this article.