← All papers
First page of Local Look-Ahead Guidance via Verifier-in-the-Loop for Automated Theorem Proving

Local Look-Ahead Guidance via Verifier-in-the-Loop for Automated Theorem Proving

Sara Rajaee, Kumar Pratik, Gabriele Cesa, Arash Behboodi

cs.AI Mar 12, 2025 · v1
Uses Lean as an automated verifier giving step-wise feedback to train an LLM for automated theorem proving.
The most promising recent methods for AI reasoning require applying variants of reinforcement learning (RL) either on rolled out trajectories from the LLMs, even for the step-wise rewards, or large quantities of human-annotated trajectory data. The reliance on the rolled-out trajectory renders the compute cost and time prohibitively high. In particular, the correctness of a reasoning trajectory can typically only be judged at its completion, leading to sparse rewards in RL or requiring expensive synthetic data generation in expert iteration-like methods. In this work, we focus on the Automatic Theorem Proving (ATP) task and propose a novel verifier-in-the-loop design, which, unlike existing approaches that leverage feedback on the entire reasoning trajectory, employs an automated verifier to give intermediate feedback at each step of the reasoning process. Using Lean as the verifier, we empirically show that the step-by-step local verification produces a global improvement in the model's reasoning accuracy and efficiency.

Reinforcement learning for theorem proving requires rolled-out trajectories to evaluate correctness, making compute costs prohibitively high. Step-wise rewards are sparse since correctness can typically only be judged at trajectory completion.

The authors propose LeanListener, a verifier-in-the-loop framework for automated theorem proving in Lean. The system uses online GRPO (Group Relative Policy Optimization) training where generated tactics are immediately verified by Lean, providing step-level feedback. The framework applies DPO and GRPO objectives with various pairing strategies, training on the LeanDojo benchmark. The key insight is using the number of remaining sub-goals as a reward signal for each generated tactic.

LeanListener with GRPO achieves 53.21% on the LeanDojo random split and 41.11% on novel premises, outperforming ReProver (51.2%/26.3%) and ReProver* (52.76%/40.86%). Online DPO with random pairing achieves 50.25% on the random split. The method significantly increases tactic validity rates compared to offline training strategies.