Math Agents: Mathematical Reasoning and Formal Proofs in LLMs

2026-04-08 1330 words 7 minutes

Contents

Mathematics has long been considered a root node to intelligence. It requires extreme reasoning and planning, profound generalization and abstraction capabilities, and features open-ended, unbounded complexity. In recent years, an “arms race” in AI math and coding capabilities has erupted, championed by models like OpenAI’s o3/o4-mini, Google’s Gemini Deep Think, DeepSeekMath-V2, and ByteDance’s Seed Prover. But why is everyone so intensely focused on math?

Because mathematics is the ultimate proxy for complex reasoning and planning, and crucially, it is relatively easy to evaluate (math has objective answers; code has unit tests). If we can enable AI to master mathematics autonomously, this capability can generalize to infinite complex tasks in the real world (such as advanced travel planning, scheduling, and scientific discovery).

This post provides a detailed overview of the mainstream paradigms for training LLMs to solve math problems, explores their current bottlenecks, and deeply dissects the underlying logic and technical details of next-generation Math Agents based on “formal verification” (such as AlphaProof).

1. How LLMs Learn Math: The Current Mainstream Recipe

Today, state-of-the-art math LLMs generally follow a standard paradigm: a strong pre-trained base model + two post-training techniques + extreme engineering.

1.1 Supervised Fine-Tuning (SFT): Good Data is All You Need

To teach a model to reason step-by-step, you first need high-quality training data. Researchers curate massive datasets consisting of “problems + step-by-step/tool-integrated solutions” generated jointly by humans and powerful LLMs. For instance, NuminaMath-1.5, one of the largest public math datasets, contains nearly 900K high-quality instances. But what if the available data only has final answers without intermediate steps? This leads to the second technique.

1.2 Reinforcement Learning (RL): Verifiability is All You Need

Once a model acquires basic reasoning skills, Reinforcement Learning (RL) steps in. By comparing the model’s generated final answer with the Ground Truth, we can provide the model with an absolutely objective feedback signal (Reward). Models like DeepSeek-R1 thrive on this exact mechanism: using rule-based automatic verification combined with algorithms like GRPO. By maximizing rewards through continuous trial and error, these models begin to exhibit self-reflection, “Aha moments,” and incredibly long chains of reasoning.

2. The Two Major Gaps in Current Math LLMs

Despite spectacular performances on benchmarks like GSM8K, MATH, and even the IMO (International Mathematical Olympiad) where problems have clear numerical answers, LLMs still face two significant gaps:

2.1 From Pre-college to Advanced Math

Current successes are predominantly concentrated in pre-college math (like AIME and IMO) because these domains are data-rich. In the realm of research-level advanced mathematics where training data is scarce, models struggle immensely.

2.2 From “Guessing Answers” to “Writing Strict Proofs”

Existing models rely heavily on data “verifiability.” If a problem requires calculating a specific number, the system can easily score the answer automatically (making GRPO highly effective). However, when it comes to theorem proofs in advanced mathematics, models expose their weaknesses. They might generate proofs that look plausible but are logically flawed (bluffing). Because we cannot use simple scripts to automatically verify the correctness of a lengthy natural language proof, applying RL becomes virtually impossible.

3. The Breakthrough: Formal Mathematics and the Lean Language

To shatter the deadlocks of data scarcity and un-verifiability, researchers have pivoted to Formal Mathematics.

3.1 Informal vs. Formal Math

Informal Math: Written in natural language (text, images, etc.). It is highly flexible and widely used, but it is exceedingly difficult for humans or machines to automatically check for logical flaws (Reasoning Gaps).
Formal Math: Transforms math into source code (using interactive theorem provers like Lean). You write theorems (e.g., the specification for 1+1=2) and their proofs as code. The compiler automatically checks if the code runs. Code compiles successfully $\equiv$ The proof is perfectly correct.

In the open-source community, the Lean Mathlib project already boasts over 1 million lines of formal code, covering 100K definitions and 200K theorems. This environment, capable of providing “instant and absolutely correct feedback,” is the ultimate playground that RL has been dreaming of.

4. AlphaProof: A Milestone Where RL Meets Formal Math

Published in Nature (2025), AlphaProof is a masterpiece combining AI with formal systems, achieving Silver-medal standards on IMO-level problems.

AlphaProof inherits the “Tabula Rasa” (learning from scratch) philosophy of AlphaGo: if an agent can master an environment purely through trial and error, it can discover new knowledge on its own. Its Master Plan involves the following critical steps:

Autoformalization: To acquire enough problems, a Formaliser model is first trained to automatically translate massive amounts of human-written natural language math problems into formal Lean code.
Supervised Training on Mathlib (SFT): The Prover model undergoes supervised learning on Mathlib’s massive codebase, establishing a strong prior intuition for selecting proof steps (tactics).
AlphaZero Reinforcement Learning: RL training is conducted on the formal problems. The model searches for proof steps within the Lean environment, with the Lean compiler acting as a flawless referee. Successful verification trajectories (Experience) are used to reinforce the Prover network.
Test-Time RL: During inference, when faced with a specific new problem, the system generates multiple variants of the problem and continues the RL search process on the fly.

Beyond OpenAI and DeepMind, other institutions are advancing rapidly:

DeepSeek Prover Series: Evolving from V1 (large-scale synthetic data SFT), to V1.5 (incorporating Lean compiler feedback into RL and Monte-Carlo Tree Search MCTS), and now V2 (combining RL with Subgoal Decomposition).
LeanDojo (ReProver): An open-source retrieval-augmented prover that retrieves relevant premises from an entire theorem library at every proof step to use as context.
Seed Prover 1.5: Designed as a true Agentic Architecture. Instead of relying on a model’s fragile implicit memory, it autonomously invokes a Mathlib search engine to find theorems, writes Python code to verify computational intuitions, and incrementally decomposes complex problems into smaller, independent Lemmas.

5. Frontier Challenges and Explorations

Although formal proofs pave the way for RL, building Math Agents still faces monumental challenges.

5.1 Taming the Infinite Proof Search Space

The game of Go has a 19x19 board, but the action space for mathematical proofs is infinite. How do we search this tree efficiently?

Synergizing with Symbolic Reasoning (The LIPS System): Proposed at ICLR 2025, the LIPS system targets Olympiad inequality proofs. It categorizes proof steps into “Scaling” (e.g., substituting via Cauchy-Schwarz) and “Rewriting” (algebraic transformations). LIPS uses Symbolic Tools to enumerate and prune Scaling operations. Combined with the LLM’s rewriting capabilities, it successfully discovers novel proof paths that human experts deemed impossible.

5.2 The Evaluation Dilemma in Autoformalization

Translating informal (natural language) theorems and proofs into formal code (Lean) is critical. But we face an extremely difficult question: How do we evaluate if the translation is correct?

Rigorous equivalence checking is computationally infeasible.
Human evaluation is exorbitantly expensive.
Proxy metrics like BLEU are completely inaccurate.
Natural language proofs are riddled with “Reasoning gaps” (e.g., “it is trivial to show that…”), whereas formal proofs must be strictly “Gap-free.”

To address this, researchers have proposed frameworks like FormalAlign (Automated Alignment Evaluation) and FormalRx. FormalRx establishes a Semantic, Constraint, Implementation (SCI) error taxonomy to automatically identify semantic failures, allowing it to Rectify and eXamine the results of autoformalization.

5.3 Flowing Back to Informal Math: Agentic AI

While formal mathematics is rigorous, it has its limitations (it cannot easily encompass cutting-edge math areas not yet supported by Mathlib). Therefore, purely natural-language math research agents like Aletheia (powered by Gemini Deep Think) remain crucial. Aletheia features a natural language verifier to identify flaws in candidate solutions, enabling an iterative “generate-verify-revise” cycle. Crucially, this agent can admit failure, a feature that drastically improves collaboration efficiency for human researchers.

6. Conclusion

From solving “1+1=2” to challenging Fermat’s Last Theorem, LLMs are redefining the boundaries of mathematical research.

Whether it’s gaining absolutely correct feedback via the Lean compiler (AlphaProof), pruning infinite search trees using symbolic systems (LIPS), or building natural-language-based iteratively revising agents (Aletheia), Math Agents are demonstrating unprecedented Domain Intelligence. As human-AI collaboration paradigms like “Vibe-Proving” take shape, these underlying technologies will not only revolutionize theoretical mathematics but also radiate outward to accelerate foundational research in physics, computer science, and beyond.