REAL-Prover: Retrieval Augmented Lean Prover for Mathematical Reasoning

Ziju Shen, Naohao Huang, Fanyi Yang, Yutong Wang, Guoxiong Gao, Tianyi Xu, Jiedong Jiang, Wanyi He, Pu Yang, Mengzhou Sun, Haocheng Ju, Peihao Wu, Bryan Dai, Bin Dong

cs.CL May 27, 2025 · v1

theorem-proving-ml tactic-automation benchmark algebra

Read PDF arXiv abstract Code

TL;DR

A fine-tuned stepwise LLM theorem prover for Lean 4 with a Mathlib retrieval system for college-level mathematics.

Abstract

Nowadays, formal theorem provers have made monumental progress on high-school and competition-level mathematics, but few of them generalize to more advanced mathematics. In this paper, we present REAL-Prover, a new open-source stepwise theorem prover for Lean 4 to push this boundary. This prover, based on our fine-tuned large language model (REAL-Prover-v1) and integrated with a retrieval system (Leansearch-PS), notably boosts performance on solving college-level mathematics problems. To train REAL-Prover-v1, we developed HERALD-AF, a data extraction pipeline that converts natural language math problems into formal statements, and a new open-source Lean 4 interactive environment (Jixia-interactive) to facilitate synthesis data collection. In our experiments, our prover using only supervised fine-tune achieves competitive results with a 23.7% success rate (Pass@64) on the ProofNet dataset-comparable to state-of-the-art (SOTA) models. To further evaluate our approach, we introduce FATE-M, a new benchmark focused on algebraic problems, where our prover achieves a SOTA success rate of 56.7% (Pass@64).

Problem

Formal theorem provers have made progress on high-school and competition-level mathematics, but few generalize to college-level and more advanced problems. Existing systems lack effective retrieval integration for leveraging large mathematical libraries during proof search.

Approach

The authors present REAL-Prover, a stepwise theorem prover for Lean 4 integrating a fine-tuned LLM (based on Qwen2.5-Math-7B) with a retrieval system (LeanSearch-PS). Training data of 210,420 state-tactic pairs is constructed via the HERALD-AF pipeline, which converts natural language mathematics into formal Lean statements, combined with expert iteration. The prover uses retrieval-augmented prompts containing natural language descriptions, proof context, current state, and retrieved premises.

Results

REAL-Prover achieves 23.7% on ProofNet with a 64x64 sampling budget using only supervised fine-tuning, surpassing DeepSeek-Prover-V1.5-RL+RMaxTS under a 1x3200 budget. It outperforms Goedel-Prover (15.6%) and DeepSeek-Prover-V1.5-SFT (15.9%) on ProofNet.

Prove System	Sampling Budget	ProofNet Test
Goedel Prover	32	15.6%
DeepSeek-Prover-V1.5-SFT	128	15.9%
DeepSeek-Prover-V1.5-RL	128	18.2%
DeepSeek-Prover-V1.5-RL + RMaxTS	1x3200	21.6%
REAL-Prover	64x64	23.7%

ProofNet results

Papers With

REAL-Prover: Retrieval Augmented Lean Prover for Mathematical Reasoning