FormalMATH: Benchmarking Formal Mathematical Reasoning of Large Language Models

Zhouliang Yu, Ruotian Peng, Keyi Ding, Yizhe Li, Zhongyuan Peng, Minghao Liu, Yifan Zhang, Zheng Yuan, Huajian Xin, Wenhao Huang, Yandong Wen, Ge Zhang, Weiyang Liu

cs.AI May 5, 2025 · v1 cs.LG

benchmark autoformalization

Read PDF arXiv abstract

TL;DR

Presents FormalMATH, a Lean4 benchmark of 5,560 formally verified problems built with a human-in-the-loop autoformalization pipeline.

Abstract

Formal mathematical reasoning remains a critical challenge for artificial intelligence, hindered by limitations of existing benchmarks in scope and scale. To address this, we present FormalMATH, a large-scale Lean4 benchmark comprising 5,560 formally verified problems spanning from high-school Olympiad challenges to undergraduate-level theorems across diverse domains (e.g., algebra, applied mathematics, calculus, number theory, and discrete mathematics). To mitigate the inefficiency of manual formalization, we introduce a novel human-in-the-loop autoformalization pipeline that integrates: (1) specialized large language models (LLMs) for statement autoformalization, (2) multi-LLM semantic verification, and (3) negation-based disproof filtering strategies using off-the-shelf LLM-based provers. This approach reduces expert annotation costs by retaining 72.09% of statements before manual verification while ensuring fidelity to the original natural-language problems. Our evaluation of state-of-the-art LLM-based theorem provers reveals significant limitations: even the strongest models achieve only 16.46% success rate under practical sampling budgets, exhibiting pronounced domain bias (e.g., excelling in algebra but failing in calculus) and over-reliance on simplified automation tactics. Notably, we identify a counterintuitive inverse relationship between natural-language solution guidance and proof success in chain-of-thought reasoning scenarios, suggesting that human-written informal reasoning introduces noise rather than clarity in the formal reasoning settings. We believe that FormalMATH provides a robust benchmark for benchmarking formal mathematical reasoning.

Problem

Formal mathematical reasoning benchmarks are limited in scope and scale, and manual formalization is costly.

Approach

FormalMATH is a Lean4 benchmark of 5,560 formally verified problems from high-school olympiad to undergraduate level across algebra, calculus, number theory, and more. A human-in-the-loop pipeline combines specialized autoformalization LLMs, multi-LLM semantic verification, and negation-based disproof filtering with off-the-shelf provers.

Results

The pipeline retains 72.09% of statements before manual verification. The strongest LLM provers reach only 16.46% under practical sampling budgets, with pronounced domain bias and over-reliance on simple automation tactics; natural-language solution guidance counterintuitively hurt proof success.

Papers With

FormalMATH: Benchmarking Formal Mathematical Reasoning of Large Language Models