MerLean-Prover: A Recursive Looping Harness for Lean 4 Theorem Proving
Jinzheng Li, Zeru Zhu, Yuanjie Ren
cs.LO
May 26, 2026 · v1
TL;DR
Presents MerLean-Prover, an end-to-end agentic harness replacing sorry with kernel-checked Lean 4 proofs, evaluated on Putnam2025.
Abstract
MerLean-Prover is an end-to-end Lean4 theorem prover that replaces sorry declarations with kernel-checkable proofs. It is built from three agent types (Planning, Check, and Lean) composed by a recursive outer loop whose unit of revision is the proof plan itself, and uses no fine-tuning, no custom RL objective, and no theorem-specific scaffolding. On FormalQualBench, a benchmark of 23 PhD-qualifying-exam theorems, MerLean-Prover solves 10/23, surpassing the strongest published open-source baseline (OpenGauss, 8/23). On Putnam2025, the same harness closes 12/12 with substantially lower total wall-clock than the next-best system that closes the full set. The harness also transfers to smaller models: Sonnet closes all four tested FormalQualBench problems, and Haiku closes the two short ones. These results suggest that harness design is a central factor in end-to-end Lean4 theorem proving, alongside raw model capability, and that a relatively simple harness can already be effective.
Problem
End-to-end Lean 4 theorem proving remains difficult: existing systems rely on fine-tuning, custom RL objectives, or theorem-specific scaffolding, and performance on challenging benchmarks like PhD-qualifying-exam theorems is limited.
Approach
MerLean-Prover uses three agent types (Planning, Check, and Lean) composed by a recursive outer loop whose unit of revision is the proof plan itself. It requires no fine-tuning, no custom RL objective, and no theorem-specific scaffolding. The harness replaces sorry declarations with kernel-checkable proofs.
Results
On FormalQualBench (23 PhD-qualifying-exam theorems), MerLean-Prover solves 10/23, surpassing the strongest open-source baseline (OpenGauss, 8/23). On Putnam2025, it closes 12/12 with lower total wall-clock time than the next-best system. The harness also transfers to smaller models: Sonnet closes all four tested FormalQualBench problems and Haiku closes two.