REAL-Prover: Retrieval Augmented Lean Prover for Mathematical Reasoning
Ziju Shen, Naohao Huang, Fanyi Yang, Yutong Wang, Guoxiong Gao, Tianyi Xu, Jiedong Jiang, Wanyi He, Pu Yang, Mengzhou Sun, Haocheng Ju, Peihao Wu, Bryan Dai, Bin Dong
cs.CL
May 27, 2025 · v1
TL;DR
A fine-tuned stepwise LLM theorem prover for Lean 4 with a Mathlib retrieval system for college-level mathematics.
Abstract
Nowadays, formal theorem provers have made monumental progress on high-school and competition-level mathematics, but few of them generalize to more advanced mathematics. In this paper, we present REAL-Prover, a new open-source stepwise theorem prover for Lean 4 to push this boundary. This prover, based on our fine-tuned large language model (REAL-Prover-v1) and integrated with a retrieval system (Leansearch-PS), notably boosts performance on solving college-level mathematics problems. To train REAL-Prover-v1, we developed HERALD-AF, a data extraction pipeline that converts natural language math problems into formal statements, and a new open-source Lean 4 interactive environment (Jixia-interactive) to facilitate synthesis data collection. In our experiments, our prover using only supervised fine-tune achieves competitive results with a 23.7% success rate (Pass@64) on the ProofNet dataset-comparable to state-of-the-art (SOTA) models. To further evaluate our approach, we introduce FATE-M, a new benchmark focused on algebraic problems, where our prover achieves a SOTA success rate of 56.7% (Pass@64).
Problem
Formal theorem provers have made progress on high-school and competition-level mathematics, but few generalize to college-level and more advanced problems. Existing systems lack effective retrieval integration for leveraging large mathematical libraries during proof search.
Approach
The authors present REAL-Prover, a stepwise theorem prover for Lean 4 integrating a fine-tuned LLM (based on Qwen2.5-Math-7B) with a retrieval system (LeanSearch-PS). Training data of 210,420 state-tactic pairs is constructed via the HERALD-AF pipeline, which converts natural language mathematics into formal Lean statements, combined with expert iteration. The prover uses retrieval-augmented prompts containing natural language descriptions, proof context, current state, and retrieved premises.
Results
REAL-Prover achieves 23.7% on ProofNet with a 64x64 sampling budget using only supervised fine-tuning, surpassing DeepSeek-Prover-V1.5-RL+RMaxTS under a 1x3200 budget. It outperforms Goedel-Prover (15.6%) and DeepSeek-Prover-V1.5-SFT (15.9%) on ProofNet.
| Prove System | Sampling Budget | ProofNet Test |
|---|
| Goedel Prover | 32 | 15.6% |
| DeepSeek-Prover-V1.5-SFT | 128 | 15.9% |
| DeepSeek-Prover-V1.5-RL | 128 | 18.2% |
| DeepSeek-Prover-V1.5-RL + RMaxTS | 1x3200 | 21.6% |
| REAL-Prover | 64x64 | 23.7% |
ProofNet results