← All papers
FormalMATH: Benchmarking Formal Mathematical Reasoning of Large Language Models
cs.AI
May 5, 2025 · v1
cs.LG
TL;DR
Presents FormalMATH, a Lean4 benchmark of 5,560 formally verified problems built with a human-in-the-loop autoformalization pipeline.
Abstract
Formal mathematical reasoning remains a critical challenge for artificial intelligence, hindered by limitations of existing benchmarks in scope and scale. To address this, we present FormalMATH, a large-scale Lean4 benchmark comprising 5,560 formally verified problems spanning from high-school Olympiad challenges to undergraduate-level theorems across diverse domains (e.g., algebra, applied mathematics, calculus, number theory, and discrete mathematics). To mitigate the inefficiency of manual formalization, we introduce a novel human-in-the-loop autoformalization pipeline that integrates: (1) specialized large language models (LLMs) for statement autoformalization, (2) multi-LLM semantic verification, and (3) negation-based disproof filtering strategies using off-the-shelf LLM-based provers. This approach reduces expert annotation costs by retaining 72.09% of statements before manual verification while ensuring fidelity to the original natural-language problems. Our evaluation of state-of-the-art LLM-based theorem provers reveals significant limitations: even the strongest models achieve only 16.46% success rate under practical sampling budgets, exhibiting pronounced domain bias (e.g., excelling in algebra but failing in calculus) and over-reliance on simplified automation tactics. Notably, we identify a counterintuitive inverse relationship between natural-language solution guidance and proof success in chain-of-thought reasoning scenarios, suggesting that human-written informal reasoning introduces noise rather than clarity in the formal reasoning settings. We believe that FormalMATH provides a robust benchmark for benchmarking formal mathematical reasoning.
Problem
Formal mathematical reasoning benchmarks are limited in scope and scale, and manual formalization is costly.
Approach
FormalMATH is a Lean4 benchmark of 5,560 formally verified problems from high-school olympiad to undergraduate level across algebra, calculus, number theory, and more. A human-in-the-loop pipeline combines specialized autoformalization LLMs, multi-LLM semantic verification, and negation-based disproof filtering with off-the-shelf provers.
Results
The pipeline retains 72.09% of statements before manual verification. The strongest LLM provers reach only 16.46% under practical sampling budgets, with pronounced domain bias and over-reliance on simple automation tactics; natural-language solution guidance counterintuitively hurt proof success.
