Leanabell-Prover: Posttraining Scaling in Formal Reasoning

Jingyuan Zhang, Qi Wang, Xingguang Ji, Yahui Liu, Yang Yue, Fuzheng Zhang, Di Zhang, Guorui Zhou, Kun Gai

cs.AI Apr 8, 2025 · v1

Read PDF arXiv abstract Code

TL;DR

Posttrains Lean 4 LLM theorem provers with continual training and reinforcement learning using the Lean 4 compiler as reward.

Abstract

Recent advances in automated theorem proving (ATP) through LLMs have highlighted the potential of formal reasoning with Lean 4 codes. However, ATP has not yet be revolutionized by the recent posttraining scaling as demonstrated by Open AI O1/O3 and Deepseek R1. In this work, we investigate the entire posttraining of ATP, aiming to align it with breakthroughs in reasoning models in natural languages. To begin, we continual train current ATP models with a hybrid dataset, which consists of numerous statement-proof pairs, and additional data aimed at incorporating cognitive behaviors that emulate human reasoning and hypothesis refinement. Next, we explore reinforcement learning with the use of outcome reward returned by Lean 4 compiler. Through our designed continual training and reinforcement learning processes, we have successfully improved existing formal provers, including both DeepSeek-Prover-v1.5 and Goedel-Prover, achieving state-of-the-art performance in the field of whole-proof generation. For example, we achieve a 59.8% pass rate (pass@32) on MiniF2F. This is an on-going project and we will progressively update our findings, release our data and training details.

Problem

Automated theorem proving in Lean 4 has not yet been advanced by posttraining scaling techniques (reinforcement learning with compiler feedback) that have proven effective for natural language reasoning.

Approach

The authors continually train existing ATP models with a hybrid dataset combining statement-proof pairs and data incorporating cognitive behaviors that emulate human reasoning and hypothesis refinement. They then apply reinforcement learning using GRPO with outcome rewards from the Lean 4 compiler (success/fail binary signal). The approach is applied on top of both DeepSeek-Prover-v1.5 and Goedel-Prover.

Results

Leanabell-Prover achieves 59.8% pass@32 on MiniF2F-test, establishing state-of-the-art for whole-proof generation. RL training improves both baseline models, with the Goedel-Prover variant showing the largest gains.

Method	Sample Budget	MiniF2F-test
DeepSeek-Prover-v1.5-Base	6400	42.2%
DeepSeek-Prover-v1.5-RL	6400	50.0%
Goedel-Prover	32	57.6%
Leanabell-Prover-GD-RL	32	59.8%

Performance on MiniF2F-test

Papers With

Leanabell-Prover: Posttraining Scaling in Formal Reasoning