Generating Millions Of Lean Theorems With Proofs By Exploring State Transition Graphs
David Yin, Jing Gao
cs.LO
Feb 16, 2025 · v1
TL;DR
Generates 4.7M Lean theorems with proofs from Mathlib4 by exploring proof state-transition graphs to train a prover.
Abstract
Large Language Models (LLMs) have demonstrated significant potential in generating mathematical proofs. However, a persistent challenge is that LLMs occasionally make mistakes, while even a minor mistake can invalidate an entire proof. Proof assistants like Lean offer a great remedy. They are designed for verifying each step of a proof in a formal language, and in recent years researchers have created AI models to generate proofs in their languages. However, the scarcity of large-scale datasets of Lean proofs restrict the performance of such Automated Theorem Proving (ATP) models. We developed LeanNavigator, a novel method for generating a large-scale dataset of Lean theorems and proofs by finding new ways to prove existing Lean theorems. By leveraging an interactive Lean client and an efficient method for proof step generation, LeanNavigator efficiently produces new theorems with corresponding proofs. Applying this approach to Mathlib4, we generated 4.7 million theorems totaling 1 billion tokens, surpassing previous datasets by more than an order of magnitude. Using this extensive dataset, we trained an AI model that outperforms the state-of-the-art ReProver model in theorem-proving tasks. These results confirm our hypothesis and demonstrate the critical role of large datasets in improving the performance of automated theorem provers.
Problem
AI models for automated theorem proving in Lean are limited by the scarcity of large-scale training datasets. Existing datasets like ReProver contain only around 112K theorems.
Approach
LeanNavigator generates millions of Lean theorems with proofs by exploring state transition graphs. Starting from known theorems, it applies tactics to create new proof states, then traverses the resulting graph to extract new theorems. Each theorem accumulates an average of 2035 intermediate states (compared to ReProver's 21.69), producing a dataset of 4.7 million theorems with 1 billion tokens.
Results
LeanNavigator generates 4.7M theorems (vs. ReProver's 112K) with 1B tokens (vs. 57M). A flan-t5-base model trained on this data achieves 39/117 on MIL and 104/493 on MiniF2F, outperforming ReProver's 30/117 and 99/493 respectively.
| Model | MIL | MiniF2F |
|---|
| LeanNavigator flan-t5 base | 39/117 | 104/493 |
| LeanNavigator flan-t5 small | 25/117 | 52/493 |
| ReProver | 30/117 | 99/493 |
Theorem proving benchmark results