MA-LoT: Model-Collaboration Lean-based Long Chain-of-Thought Reasoning enhances Formal Theorem Proving

Ruida Wang, Rui Pan, Yuxin Li, Jipeng Zhang, Yizhen Jia, Shizhe Diao, Renjie Pi, Junjie Hu, Tong Zhang

cs.CL Mar 5, 2025 · v1

theorem-proving-ml autoformalization ai-agents

Read PDF arXiv abstract Code

TL;DR

MA-LoT is a model-collaboration framework for Lean 4 theorem proving combining long chain-of-thought with the Lean4 verifier.

Abstract

Solving mathematical problems using computer-verifiable languages like Lean has significantly impacted the mathematical and computer science communities. State-of-the-art methods utilize a single Large Language Model (LLM) to generate complete proof or perform tree search, but they fail to balance these tasks. We propose **MA-LoT**: *Model-CollAboration Lean-based Long Chain-of-Thought*, a comprehensive framework for Lean4 theorem proving to solve this issue. It separates the cognition tasks of general NL for whole-proof generation and error analysis for proof correction using the model-collaboration method. We achieve this by structured interaction of the LLM and Lean4 verifier in Long CoT. To implement the framework, we propose the novel *LoT-Transfer Learning* training-inference pipeline, which enables the Long CoT thinking capability to LLMs without special data annotation. Extensive experiment shows that our framework achieves a **61.07%** accuracy rate on the Lean4 version of the MiniF2F-Test dataset, largely outperforming DeepSeek-V3 (33.61%), single-model tree search (InternLM-Step-Prover, 50.70%), and whole-proof generation (Godel-Prover, 55.33%) baselines. Furthermore, our findings highlight the potential of combining Long CoT with formal verification for a more insightful generation in a broader perspective.

Problem

State-of-the-art Lean 4 theorem proving methods use a single LLM for either whole-proof generation or tree search, but fail to balance these complementary tasks effectively.

Approach

MA-LoT (Model-Collaboration Lean-based Long Chain-of-Thought) separates cognition tasks: one model handles whole-proof generation in natural language while another performs error analysis and proof correction. The framework uses structured interaction between the LLM and Lean 4 verifier within a Long CoT process. A LoT-Transfer Learning pipeline enables Long CoT thinking without special data annotation.

Results

MA-LoT achieves 61.07% on MiniF2F-Test (Lean 4), outperforming DeepSeek-V3 (33.61%), InternLM-Step-Prover tree search (50.70%), and Godel-Prover whole-proof generation (55.33%). Three rounds of correction progressively improve from 54.92% to 61.07%.

Method	Round 1	Round 2	Round 3
DS-Prover-v1.5	51.64%	53.28%	54.51%
Godel-Prover-SFT	54.92%	59.43%	61.07%

Performance on MiniF2F-Test

Papers With

MA-LoT: Model-Collaboration Lean-based Long Chain-of-Thought Reasoning enhances Formal Theorem Proving