← All papers
First page of StepFun-Prover Preview: Let's Think and Verify Step by Step

StepFun-Prover Preview: Let's Think and Verify Step by Step

Shijie Shang, Ruosi Wan, Yue Peng, Yutong Wu, Xiong-hui Chen, Jie Yan, Xiangyu Zhang

cs.AI Jul 27, 2025 · v1
StepFun-Prover is an LLM trained with tool-integrated RL to generate Lean 4 proofs, evaluated on miniF2F-test.
We present StepFun-Prover Preview, a large language model designed for formal theorem proving through tool-integrated reasoning. Using a reinforcement learning pipeline that incorporates tool-based interactions, StepFun-Prover can achieve strong performance in generating Lean 4 proofs with minimal sampling. Our approach enables the model to emulate human-like problem-solving strategies by iteratively refining proofs based on real-time environment feedback. On the miniF2F-test benchmark, StepFun-Prover achieves a pass@1 success rate of $70.0\%$. Beyond advancing benchmark performance, we introduce an end-to-end training framework for developing tool-integrated reasoning models, offering a promising direction for automated theorem proving and Math AI assistant.

Automated theorem provers for Lean 4 require extensive sampling to find proofs, limiting practical utility. Achieving high pass@1 rates with minimal interaction remains challenging.

The authors present StepFun-Prover Preview, an LLM for formal theorem proving through tool-integrated reasoning. A reinforcement learning pipeline incorporating tool-based interactions enables the model to emulate human-like problem-solving by iteratively refining proofs based on real-time Lean 4 REPL feedback. The training framework consists of a cold start phase compiling open-source formal math datasets, followed by tool-integrated GRPO where the model learns to interact with the Lean environment, plus iterative RL-SFT cycles.

On miniF2F-test, StepFun-Prover-Preview-32B achieves 70.0% pass@1, outperforming DeepSeek-Prover-V2-671B (61.9%) and Kimina-Prover-72B (63.9%). The 7B variant achieves 66.0%. Performance scales with generation length, reaching 70.0% at 20,480 tokens versus 58.3% at 4,096.

ModelPass@1
DeepSeek-Prover-V2-7B58.6%
DeepSeek-Prover-V2-671B61.9%
Kimina-Prover-8B61.1%
Kimina-Prover-72B63.9%
StepFun-Prover-Preview-7B66.0%
StepFun-Prover-Preview-32B70.0%
miniF2F-test pass@1 comparison