← All papers
First page of Characterizing Paraphrase-Induced Failures in Lean 4 Autoformalization

Characterizing Paraphrase-Induced Failures in Lean 4 Autoformalization

William Feng, Ethan Lou, Aryan Sharma

cs.LG Apr 25, 2026 · v1
Studies paraphrase-induced failure modes of Lean 4 autoformalizers.
Lean 4 autoformalization has become increasingly popular in recent years, with frontier language models and open-weight autoformalizers now producing valid formalizations of mathematical theorems. However, these evaluations often rely on single canonical phrasings of theorems and rarely probe whether outputs are robust to natural variation in inputs, while prior work has shown that semantically equivalent paraphrases often induce divergent formal outputs. We study the structure of these divergences in Lean 4 by applying deterministic paraphrase rules to datasets of undergraduate and Olympiad-level math problems. Across four frontier models and three open-weight autoformalizers, we find that paraphrase sensitivity is dominated by failures at the code-generation layer, and that these failures are typed differently by dataset. Furthermore, these patterns generalize to open-weight models, showing that state-of-the-art autoformalizers still struggle to generate valid Lean code. Our results provide a failure-mode taxonomy for autoformalization and motivate training-time interventions targeted at specific compilation failures.

Lean 4 autoformalization benchmarks evaluate models on single canonical phrasings of theorems and rarely probe whether outputs are robust to natural variation in inputs. Semantically equivalent paraphrases often induce divergent formal outputs, but the structure of these divergences is poorly understood.

The authors apply 60 deterministic paraphrase rules across 11 categories to datasets of undergraduate (ProofNet#) and Olympiad-level (miniF2F) math problems. They test four frontier models (GPT-5.4, o3, o1, o4-mini) and three open-weight autoformalizers on the original and paraphrased inputs, then classify all compilation failures by type (unknown identifier, syntax error, elaboration error, other) to build a failure-mode taxonomy for Lean 4 autoformalization.

Figure 1: (a) Per-model surface consistency under paraphrased inputs on ProofNet# (teal) and miniF2F (orange). Pair agreement ranges from 19 – 56\% across the GPT panel. Error bars are 95\% bootstrap CIs. (b) Per-category BEq+ surface consistency, pooled across the GPT panel, with ProofNet# categories (teal) and miniF2F categories (orange) shown together.

Pair agreement (surface consistency) ranges from 19-56% across the GPT panel. Paraphrase sensitivity is dominated by failures at the code-generation layer rather than semantic misunderstanding. Failure types differ by dataset: ProofNet# shows more unknown-identifier errors while miniF2F shows more syntax and elaboration errors. These patterns generalize to open-weight models, indicating that state-of-the-art autoformalizers still struggle to generate valid Lean code under input variation.

Figure 3: Per-(rule, model) BEq+ equivalence on ProofNet# (GPT panel). Rows sorted by panel-mean ascending. Dots denote cells with no paired predictions.
DatasetModelNBasePertBoth
PN#GPT-5.429911.7%11.4%9.7%
PN#o323718.1%19.0%14.3%
PN#o124913.3%12.0%8.4%
PN#o4-mini32412.3%12.0%6.8%
MF2FGPT-5.482715.8%16.1%14.6%
MF2Fo358223.7%23.9%20.6%
MF2Fo172324.2%24.1%19.9%
MF2Fo4-mini65821.7%21.6%16.9%
Compilation success rates across models and datasets