Evaluating Autoformalization Robustness via Semantically Similar Paraphrasing

Hayden Moore, Asfahan Shah

cs.CL Nov 16, 2025 · v1

autoformalization theorem-proving-ml benchmark

Read PDF arXiv abstract

TL;DR

Tests LLM autoformalization robustness on MiniF2F and Lean 4 ProofNet using paraphrased statements, measuring Lean compilation validity.

Abstract

Large Language Models (LLMs) have recently emerged as powerful tools for autoformalization. Despite their impressive performance, these models can still struggle to produce grounded and verifiable formalizations. Recent work in text-to-SQL, has revealed that LLMs can be sensitive to paraphrased natural language (NL) inputs, even when high degrees of semantic fidelity are preserved. In this paper, we investigate this claim in the autoformalization domain. Specifically, we evaluate the robustness of LLMs generating formal proofs with semantically similar paraphrased NL statements by measuring semantic and compilation validity. Using the formal benchmarks MiniF2F and Lean 4 version of ProofNet, and two modern LLMs, we generate paraphrased natural language statements and cross-evaluate these statements across both models. The results of this paper reveal performance variability across paraphrased inputs, demonstrating that minor shifts in NL statements can significantly impact model outputs.

Problem

LLMs used for autoformalization may be sensitive to surface-level paraphrasing of natural language inputs, but this robustness has not been systematically evaluated in the formal proof domain.

Approach

The authors evaluate the robustness of LLMs generating Lean formal proofs when given semantically similar paraphrased natural language statements. Using MiniF2F and a Lean 4 version of ProofNet as benchmarks and two modern LLMs, they generate paraphrased statements and cross-evaluate outputs by measuring both semantic validity and compilation success.

Results

The results reveal performance variability across paraphrased inputs, demonstrating that minor shifts in natural language statements can significantly impact model outputs in autoformalization.

Papers With

Evaluating Autoformalization Robustness via Semantically Similar Paraphrasing