Inference-Time Diversity in RL-Trained Lean Theorem Provers: A Diagnostic Study
Zachary Burton
cs.AI
Jan 22, 2026 · v1
TL;DR
Diagnoses inference-time mode collapse of RL-trained Lean theorem provers on miniF2F.
Abstract
RL-trained Lean theorem provers mode-collapse at inference time: on miniF2F-test with DeepSeek-Prover-V1.5-RL, doubling the i.i.d.\ sampling budget from $k{=}32$ to $k{=}64$ produces zero additional solved theorems (42/244 in both cases). A fixed schedule of 15 tactic skeletons breaks this plateau and recovers a $+45%$ relative improvement at $k{=}16$ (mean $Δ= +12.3 \pm 4.2$ theorems across $n{=}3$ seeds, sign preserved in every seed). A controlled diversity ablation rules out the prompt-diversity confound: tactic skeletons help, paraphrases match the baseline, and irrelevant Lean comments actively degrade. A leave-one-out formalization-difficulty stratification reveals a structural-content gradient across the three perturbations. The phenomenon is RL-specific: V1.5-Base proves zero theorems regardless of intervention, identifying RL as the stage that creates the proof capability which subsequently collapses; extending to two additional 7B Lean provers, RL-trained DeepSeek-Prover-V2-7B contributes $+3$ frontier solves no i.i.d.\ baseline can reach despite a flat aggregate, while SFT-trained Goedel-Prover does not ($-10.0 \pm 4.4$ theorems, $n{=}3$, sign preserved every seed). Inference-time structural diversity is a cheap, complementary axis for RL-trained provers, orthogonal to scaling model size or training compute.
Problem
RL-trained Lean theorem provers suffer from mode-collapse at inference time: on miniF2F-test with DeepSeek-Prover-V1.5-RL, doubling the sampling budget from k=32 to k=64 produces zero additional solved theorems (42/244 in both cases).
Approach
The authors introduce a fixed schedule of 15 tactic skeletons (hand-selected Lean tactic patterns) crossed with 8 goal-hint comments, creating a 120-attempt grid. Each attempt uses a distinct structured query combining the theorem statement with a tactic skeleton. A controlled diversity ablation compares tactic skeletons against paraphrases and irrelevant comments to isolate the source of improvement. A leave-one-out difficulty stratification analyzes which theorem classes benefit.
Results
The skeleton schedule recovers a +45% relative improvement at k=16 (mean +12.3 +/- 4.2 theorems across n=3 seeds). Tactic skeletons help, paraphrases match the baseline, and irrelevant Lean comments actively degrade performance. The phenomenon is RL-specific: V1.5-Base proves zero theorems under any condition, confirming the gains require an RL-trained model whose mode-collapse can be broken by structural diversity.
| Condition | Pass@16 | Pass@32 | Pass@64 |
|---|
| A-RL (i.i.d. baseline) | 38/244 (15.6%) | 42/244 (17.2%) | 42/244 (17.2%) |
Pass rates under different conditions on miniF2F-test