Leanabell-Prover: Posttraining Scaling in Formal Reasoning
Jingyuan Zhang, Qi Wang, Xingguang Ji, Yahui Liu, Yang Yue, Fuzheng Zhang, Di Zhang, Guorui Zhou, Kun Gai
cs.AI
Apr 8, 2025 · v1
TL;DR
Posttrains Lean 4 LLM theorem provers with continual training and reinforcement learning using the Lean 4 compiler as reward.
Abstract
Recent advances in automated theorem proving (ATP) through LLMs have highlighted the potential of formal reasoning with Lean 4 codes. However, ATP has not yet be revolutionized by the recent posttraining scaling as demonstrated by Open AI O1/O3 and Deepseek R1. In this work, we investigate the entire posttraining of ATP, aiming to align it with breakthroughs in reasoning models in natural languages. To begin, we continual train current ATP models with a hybrid dataset, which consists of numerous statement-proof pairs, and additional data aimed at incorporating cognitive behaviors that emulate human reasoning and hypothesis refinement. Next, we explore reinforcement learning with the use of outcome reward returned by Lean 4 compiler. Through our designed continual training and reinforcement learning processes, we have successfully improved existing formal provers, including both DeepSeek-Prover-v1.5 and Goedel-Prover, achieving state-of-the-art performance in the field of whole-proof generation. For example, we achieve a 59.8% pass rate (pass@32) on MiniF2F. This is an on-going project and we will progressively update our findings, release our data and training details.
Problem
Automated theorem proving in Lean 4 has not yet been advanced by posttraining scaling techniques (reinforcement learning with compiler feedback) that have proven effective for natural language reasoning.
Approach
The authors continually train existing ATP models with a hybrid dataset combining statement-proof pairs and data incorporating cognitive behaviors that emulate human reasoning and hypothesis refinement. They then apply reinforcement learning using GRPO with outcome rewards from the Lean 4 compiler (success/fail binary signal). The approach is applied on top of both DeepSeek-Prover-v1.5 and Goedel-Prover.
Results
Leanabell-Prover achieves 59.8% pass@32 on MiniF2F-test, establishing state-of-the-art for whole-proof generation. RL training improves both baseline models, with the Goedel-Prover variant showing the largest gains.
| Method | Sample Budget | MiniF2F-test |
|---|
| DeepSeek-Prover-v1.5-Base | 6400 | 42.2% |
| DeepSeek-Prover-v1.5-RL | 6400 | 50.0% |
| Goedel-Prover | 32 | 57.6% |
| Leanabell-Prover-GD-RL | 32 | 59.8% |
Performance on MiniF2F-test