FormalML: A Benchmark for Evaluating Formal Subgoal Completion in Machine Learning Theory

Xiao-Wen Yang, Zihao Zhang, Jianuo Cao, Zhi Zhou, Zenan Li, Lan-Zhe Guo, Yuan Yao, Taolue Chen, Yu-Feng Li, Xiaoxing Ma

cs.CL Sep 26, 2025 · v1

benchmark dataset theorem-proving-ml formalization

Read PDF arXiv abstract Code

TL;DR

Introduces FormalML, a Lean 4 benchmark of 4937 machine-learning-theory subgoal-completion problems and evaluates LLM provers on it.

Abstract

Large language models (LLMs) have recently demonstrated remarkable progress in formal theorem proving. Yet their ability to serve as practical assistants for mathematicians, filling in missing steps within complex proofs, remains underexplored. We identify this challenge as the task of subgoal completion, where an LLM must discharge short but nontrivial proof obligations left unresolved in a human-provided sketch. To study this problem, we introduce FormalML, a Lean 4 benchmark built from foundational theories of machine learning. Using a translation tactic that converts procedural proofs into declarative form, we extract 4937 problems spanning optimization and probability inequalities, with varying levels of difficulty. FormalML is the first subgoal completion benchmark to combine premise retrieval and complex research-level contexts. Evaluation of state-of-the-art provers highlights persistent limitations in accuracy and efficiency, underscoring the need for more capable LLM-based theorem provers for effective subgoal completion,

Problem

LLMs can prove theorems end-to-end, but their ability to fill in missing steps within complex human-written proofs (subgoal completion) remains underexplored, and no benchmark exists for this task in research-level mathematical contexts.

Approach

FormalML is a Lean 4 benchmark built from machine learning theory (optimization and probability). A translation tactic converts procedural proofs into declarative form, extracting 4937 subgoal-completion problems at varying difficulty levels. The benchmark combines premise retrieval requirements with complex research-level contexts, distinguishing it from prior theorem-proving benchmarks.

Results

State-of-the-art provers show persistent limitations. The best tree-search method (BFS-Prover, 7B) achieves 25.31% overall at Pass@50. The best whole-proof generation method (Goedel-Prover-V2, 32B) reaches 52.54% at Pass@32. Performance on probability problems consistently lags optimization, underscoring difficulty with measure-theoretic reasoning.

Method	Model Size	Pass@K (%) All
BFS-Prover	7B	25.31 (Pass@50)
Kimina-Prover-Preview	7B	45.74 (Pass@32)
Goedel-Prover-V2	32B	52.54 (Pass@32)
Reprover	229M	24.79 (Pass@50)

Performance on FormalML

Papers With

FormalML: A Benchmark for Evaluating Formal Subgoal Completion in Machine Learning Theory