← All papers

Expected Value Alignment for Generative Reward Modeling in Formal Mathematics Verification

Shihao Ji, Haotao Tan, Zihui Song, Mingyu Li

cs.AI May 31, 2026 · v1

theorem-proving-ml

Read PDF arXiv abstract

TL;DR

Introduces EVA reward modeling and the Leibniz reward model for LLM-based Lean 4 formal verification.

Abstract

Large Language Models (LLMs) are increasingly used with formal interactive theorem provers such as Lean 4. Scaling these systems with reinforcement learning or search methods requires process reward models (PRMs) that can evaluate intermediate reasoning steps. Existing reward-model designs expose a practical trade-off. Value-head models provide continuous scores but modify the generative model interface, while generative reward models preserve textual rationales but are poorly matched to continuous floating-point regression because numeric values are split across tokens. We introduce Expected Value Alignment (EVA), a reward-modeling procedure that keeps the surface output discrete while extracting continuous scores from the model's token distribution. The model emits integer scores in a structured JSON format, and EVA computes a continuous score as the expectation over the logits of the corresponding anchor tokens. Training combines the causal language modeling objective with an auxiliary mean squared error loss on these expected values. We instantiate EVA in \textit{Leibniz}, a reward model for Lean 4 formal verification, and evaluate it against zero-shot and reward-modeling baselines. The evaluation demonstrates that continuous logit-based scoring significantly reduces discretization artifacts while retaining the interpretability of generative critiques.

Problem

Process reward models for Lean 4 theorem proving face a trade-off: value-head models produce continuous scores but lose interpretability, while generative reward models preserve textual critiques but suffer from discretization artifacts when emitting floating-point scores as tokens.

Approach

The authors introduce Expected Value Alignment (EVA), a reward-modeling procedure that keeps surface output discrete (integer scores in structured JSON) while extracting continuous scores from the model's token distribution. EVA computes a continuous score as the expectation over logits of anchor tokens (integers 1-5), and training combines a causal language modeling objective with an auxiliary MSE loss on these expected values. The method is instantiated in Leibniz, a reward model for Lean 4 formal verification that evaluates proof steps along three dimensions: Logic, Alignment, and Clarity.

Results

Leibniz-1.5B achieves a Pearson correlation of 0.824 and ranking accuracy of 84.6% on a Lean 4 verification hold-out set, outperforming both value-head reward models (r=0.782, 81.2%) and standard SFT generative reward models (r=0.725, 78.5%), while retaining the generative critique interface.

Model	Logic MSE	Align MSE	Pearson r	Ranking Acc.
Zero-Shot GPT-4o	0.842	1.154	0.612	72.4%
Zero-Shot Qwen2.5-1.5B	1.250	1.680	0.450	58.1%
Standard SFT Generative RM	0.485	0.521	0.725	78.5%
Value-Head RM	0.312	0.345	0.782	81.2%
Leibniz-1.5B (Ours)	0.334	0.368	0.824	84.6%

Performance comparison on the Lean 4 mathematical verification hold-out set

Papers With

Expected Value Alignment for Generative Reward Modeling in Formal Mathematics Verification