Lean-ing on Quality: How High-Quality Data Beats Diverse Multilingual Data in AutoFormalization
Willy Chan, Michael Souliman, Jakob Nordhagen, Brando Miranda, Elyas Obbad, Kai Fronsdal Sanmi Koyejo
cs.AI
Feb 18, 2025 · v1
TL;DR
Improves LLM autoformalization targeting Lean using back-translation and proof-state-aware data, evaluated on ProofNet with LeanDojo and Mathlib4 data.
Abstract
Autoformalization, the process of transforming informal mathematical language into formal specifications and proofs remains a difficult task for state-of-the-art (large) language models. Existing works point to competing explanations for the performance gap. To this end, we introduce a novel methodology that leverages back-translation with hand-curated prompts to enhance the mathematical capabilities of language models, particularly addressing the challenge posed by the scarcity of labeled data. Specifically, we evaluate three primary variations of this strategy: (1) on-the-fly (online) backtranslation, (2) distilled (offline) backtranslation with few-shot amplification, and (3) line-by-line proof analysis integrated with proof state information. Each variant is designed to optimize data quality over quantity, focusing on the high fidelity of generated proofs rather than sheer data scale. Our findings provide evidence that employing our proposed approaches to generate synthetic data, which prioritizes quality over volume, improves the Autoformalization performance of LLMs as measured by standard benchmarks such as ProofNet. Crucially, our approach outperforms pretrained models using a minimal number of tokens. We also show, through strategic prompting and backtranslation, that our approaches surpass the performance of fine-tuning with extensive multilingual datasets such as MMA on ProofNet with only 1/150th of the tokens. Taken together, our methods show a promising new approach to significantly reduce the resources required to formalize proofs, thereby accelerating AI for math.
Problem
Autoformalization remains difficult for large language models, and existing approaches disagree on whether the bottleneck is data volume or data quality. Multilingual datasets like MMA are large but may not be the most efficient path.
Approach
The authors introduce back-translation with hand-curated prompts in three variations: (1) on-the-fly (online) backtranslation, (2) distilled (offline) backtranslation with few-shot amplification, and (3) line-by-line proof analysis integrated with proof state information. Each variant prioritizes quality over quantity of synthetic Lean training data.
Results
The approach outperforms pretrained models on ProofNet using minimal tokens. It surpasses fine-tuning with the extensive multilingual MMA dataset using only 1/150th of the tokens, providing evidence that high-quality data focused on fidelity beats sheer volume for autoformalization.