Uses Lean's elaboration as a process oracle, parsing proofs into tactics to supply dense tactic-level RL rewards on MiniF2F and ProofNet.
Abstract
While reinforcement learning from verifiable rewards (RLVR) typically has relied on a single binary verification signal, symbolic proof assistants in formal reasoning offer rich, fine-grained structured feedback. This gap between structured processes and unstructured rewards highlights the importance of feedback that is both dense and sound. In this work, we demonstrate that the Lean proof assistant itself can serve as a symbolic process oracle, supplying both outcome-level and fine-grained tactic-level verified feedback during training. Proof attempts are parsed into tactic sequences, and Lean's elaboration marks both locally sound steps and the earliest failing step, yielding dense, verifier-grounded credit signals rooted in type theory. We incorporate these structured rewards into a GRPO-style reinforcement learning objective with first-error propagation and first-token credit methods that balances outcome- and process-level advantages. Experiments with STP-Lean and DeepSeek-Prover-V1.5 show that tactic-level supervision outperforms outcome-only baselines in most settings, delivering improvements on benchmarks such as MiniF2F and ProofNet. Beyond empirical gains, our study highlights a broader perspective: symbolic proof assistants are not only verifiers at evaluation time, but can also act as process-level reward oracles during training. This opens a path toward reinforcement learning frameworks that combine the scalability of language models with the reliability of symbolic verification for formal reasoning.
Problem
Reinforcement learning from verifiable rewards for theorem proving usually relies on a single binary signal, ignoring the fine-grained structured feedback a symbolic proof assistant can give.
Approach
Lean serves as a symbolic process oracle. Proof attempts are parsed into tactic sequences, and Lean's elaboration marks locally sound steps and the earliest failing step, yielding dense type-theory-grounded credit signals. These outcome- and tactic-level rewards are combined in a GRPO-style objective with first-error propagation and first-token credit assignment.
Figure 1: Overall framework for combining outcome and tactic level rewards via Lean: the proof Y is parsed into tactics T_{1},\dots,T_{5} , with Lean providing outcome feedback g(Y) and step-level errors (e.g., failure at T_{3} invalidates later tactics). Rewards are then assigned to the first token of each tactic.
Results
On STP-Lean and DeepSeek-Prover-V1.5, tactic-level supervision outperforms outcome-only baselines in most settings; at sample budget 64 the STP model reaches 59.2% on MiniF2F-test and 19% on ProofNet-test.
Figure 2: Training dynamics showing (a) outcome reward, (b) tactic reward, (c) entropy, and (d) mean of response length during reinforcement learning. We used time-weighted smooting with smooting factor 0.99.
Model (budget 64)
MiniF2F-test
ProofNet-test
STP + Outcome only (GRPO)
57.9%
19%
STP + Outcome+Tactic RL (ours)
59.2%
19%
DeepSeek-Prover-V1.5 + Outcome only
57.4%
17.6%
MiniF2F-test and ProofNet-test pass rates at sample budget 64.