FATE: A Formal Benchmark Series for Frontier Algebra of Multiple Difficulty Levels

Jiedong Jiang, Wanyi He, Yuefeng Wang, Guoxiong Gao, Yongle Hu, Jingting Wang, Nailin Guan, Peihao Wu, Chunbo Dai, Liang Xiao, Bin Dong

cs.LG Nov 4, 2025 · v1

benchmark theorem-proving-ml algebra mathlib

Read PDF arXiv abstract Code

TL;DR

Introduces FATE-H and FATE-X, Lean/Mathlib formal-algebra benchmarks of 100 problems each, evaluating LLM provers on advanced algebra.

Abstract

Recent advances in large language models (LLMs) have demonstrated impressive capabilities in formal theorem proving, particularly on contest-based mathematical benchmarks like the IMO. However, these contests do not reflect the depth, breadth, and abstraction of modern mathematical research. To bridge this gap, we introduce FATE (Formal Algebra Theorem Evaluation), a new benchmark series in formal algebra designed to chart a course toward advanced mathematical reasoning. We present two new components, FATE-H and FATE-X, each with 100 problems in abstract and commutative algebra. The FATE series spans a difficulty spectrum from undergraduate exercises to problems exceeding PhD qualifying exams. Notably, FATE-X is the first formal benchmark to surpass both PhD-level exam difficulty and the coverage of the Mathlib library. Our evaluations of state-of-the-art LLM provers on this new benchmark reveal a stark performance gap compared to contest math: the best model achieves only 3% (pass@64) accuracy on FATE-H and 0% on FATE-X. Our two-stage evaluation reveals that models' natural-language reasoning is notably more accurate than their ability to formalize this reasoning. We systematically classify the common errors that arise during this formalization process. Furthermore, a comparative study shows that a specialized prover can exhibit less effective reflection than general-purpose models, reducing its accuracy at the natural-language stage. We believe FATE provides a robust and challenging benchmark that establishes essential checkpoints on the path toward research-level formal mathematical reasoning.

Problem

Existing formal theorem proving benchmarks focus on contest-level mathematics and do not reflect the depth and abstraction of modern mathematical research, particularly in algebra.

Approach

FATE (Formal Algebra Theorem Evaluation) introduces benchmarks FATE-H and FATE-X with 100 problems each in abstract and commutative algebra, spanning difficulty from undergraduate exercises to problems exceeding PhD qualifying exams. FATE-X is the first formal benchmark to surpass both PhD-level difficulty and Mathlib library coverage. A two-stage evaluation separates natural-language reasoning accuracy from formalization ability.

Results

The best model (DeepSeek-Prover-V2-671B) achieves only 3% on FATE-H (pass@64) and 0% on FATE-X. Natural-language reasoning is notably more accurate than formalization: DeepSeek-R1 scores 71% at NL stage on FATE-H but 0% in formal proofs. Dominant error types are Mathlib hallucination and Lean proficiency issues.

Model	FATE-M	FATE-H	FATE-X
DeepSeek-Prover-V2-671B	62.7%	3.0%	0.0%
Goedel-Prover-V2-32B	48.7%	2.0%	0.0%
o3	51.3%	3.0%	0.0%
Claude-Sonnet-4	45.3%	0.0%	0.0%

Formal proof accuracy on FATE benchmarks

Papers With

FATE: A Formal Benchmark Series for Frontier Algebra of Multiple Difficulty Levels