Evaluating the Robustness of Proof Autoformalization in Lean 4

Zhengtao Gui, Sheng Yang, Zhouxing Shi

cs.CL Jun 12, 2026 · v1 cs.AI cs.LG

autoformalization benchmark theorem-proving-ml

Read PDF arXiv abstract Code

TL;DR

Builds a benchmark measuring the robustness of LLM proof autoformalization into Lean 4 under global and local perturbations of informal proofs.

Abstract

Proof autoformalization aims to translate a mathematical informal proof written in natural language into a formal proof in a formal language such as Lean~4. Several works have developed LLM-based models for proof autoformalization. However, existing evaluations have typically focused on translating well-formed informal proofs from curated datasets. We argue that a robust proof autoformalizer must remain faithful even for informal proofs that diverge from these idealized ones, and we present the first study on the robustness of proof autoformalization models. We formulate two categories of perturbations and evaluate robustness under each: a global perturbation paraphrases the informal proof in a different style, under which the formalization should remain consistent; a local perturbation alters a value, symbol, or proof step, possibly in a counterfactual way, and a robust formalization should faithfully reflect the perturbation rather than reverting to the original one or inferring a different one on its own. We build a benchmark with both perturbations on miniF2F and MATH-500, and automatically measure how stable a proof autoformalization's correctness is under global perturbations and how faithfully its output reflects local perturbations. We evaluate seven recent models, all of which are sensitive to global perturbations and mostly fail to remain faithful under local perturbations. Code and data are available via https://github.com/ucr-rai/robust-proof-autoformalization.

Problem

Proof autoformalization models are evaluated only on well-formed informal proofs from curated datasets, leaving their robustness to inputs that diverge from idealized proofs unmeasured.

Approach

Defines two perturbation categories and builds a benchmark on miniF2F and MATH-500 to test each: a global perturbation paraphrases the informal proof in a different style, under which the Lean 4 formalization should stay consistent; a local perturbation alters a value, symbol, or step, possibly counterfactually, which a faithful formalization must reflect rather than revert. Correctness stability under global perturbations and faithfulness under local ones are measured automatically across seven recent models.

Results

All seven evaluated models are sensitive to global perturbations and mostly fail to remain faithful under local perturbations.

Papers With

Evaluating the Robustness of Proof Autoformalization in Lean 4