Distilling LLM Feedback for Lean Theorem Proving

Gaetan Narozniak, Gérard Biau, Rémi Munos, Ahmad Rammal, Pierre Marion

cs.AI May 29, 2026 · v1

Read PDF arXiv abstract

TL;DR

Proposes Feedback Distillation for post-training LLMs on Lean 4 theorem proving, evaluated on MiniF2F and LeanWorkbook using a Lean verifier.

Abstract

Post-training for reasoning models typically combines supervised fine-tuning with reinforcement learning from verifiable rewards, most commonly with GRPO. However, this algorithm suffers from sparse rewards, limited exploration, and mode collapse. Building upon recent works on self-distillation, we propose Feedback Distillation, a training method where the model is trained to match, at the token level, its own distribution conditioned on privileged feedback produced by a language model. Feedback Distillation offers token-level supervision and can inject external knowledge. Evaluating our method for Lean4 theorem-proving, we find that Feedback Distillation maintains greater diversity in generated trajectories than GRPO, yielding higher policy entropy and better pass@k scaling. The two methods are complementary: initializing GRPO from a Feedback Distillation checkpoint outperforms either method alone. All in all, our results suggest a promising avenue to improve post-training for complex reasoning.

Problem

Post-training of reasoning models with GRPO suffers from sparse rewards, limited exploration, and mode collapse. In Lean 4 theorem proving, where the search space includes over 300,000 Mathlib declarations, these limitations are especially severe for models starting with low proficiency.

Approach

The authors propose Feedback Distillation, a self-distillation method where the model is trained to match, at the token level, its own distribution conditioned on privileged feedback from an LLM (e.g., Claude). The teacher is an EMA-updated copy of the student that receives textual feedback on the student's proof attempt. The loss is a per-token KL divergence between teacher and student distributions along student-sampled trajectories. The method is evaluated in a Lean 4 environment with tool-use (file writing, compilation, Mathlib regex search).

Figure 1: Example of Feedback Distillation on a Lean statement from the MiniF2F dataset (Zheng et al. , 2022 ) , with Qwen3-8B as the student. Claude provides feedback on the student’s attempt. The student’s answer is then colored token-by-token by \log p_{\text{teacher}}-\log p_{\text{student}} : green tokens are preferred by the teacher, red tokens by the student, and white tokens are assigned r

Results

Feedback Distillation maintains higher policy entropy and discovers Mathlib lemmas faster than GRPO. Initializing GRPO from a Feedback Distillation checkpoint outperforms either method alone. On MiniF2F, the combined pipeline achieves the best accuracy, and pass@k scaling is improved due to greater trajectory diversity.

Papers With

Distilling LLM Feedback for Lean Theorem Proving