Introduces TheoremBench, a Lean4 benchmark of ~100 classical theorems with extracted premises for evaluating LLM provers.
Abstract
LLMs have recently achieved strong results on formal proving benchmarks. However, existing evaluations remain heavily concentrated on competition-style problems and often fail to capture how models behave on longer, more dependency-rich mathematical developments. We introduce TheoremBench, a Lean4 benchmark designed to evaluate theorem provers beyond contest settings. The benchmark is built from nearly one hundred classical theorems and is released in two complementary forms: a plain main version containing one target theorem per instance, and a premised version that expands each theorem into a structured family of related proving tasks consisting of the main theorem together with automatically extracted supporting subtheorems. This design enables evaluation of not only whether the final theorem was proved from scratch, but also of partial progress through the internal proof structure of a theorem. Our experiments show that explicit premises substantially improve performance for Lean4-capable prover models. To provide a comprehensive evaluation, we introduce theorem-level coverage and token-efficiency metrics that expose qualitative differences in proof behavior. The results show that current provers remain strongly biased toward easy subtheorems and often solve theorems through long and inefficient tactic traces rather than compact proof plans. TheoremBench therefore provides a more fine-grained view of formal reasoning ability and highlights the importance of structural benchmark design for evaluating Lean4 theorem provers.
Problem
Existing formal-proving benchmarks concentrate on competition-style problems and do not capture how provers behave on longer, dependency-rich mathematical developments.
Approach
TheoremBench is built from Lean4 formalizations of classical theorems (inspired by Wiedijk's 100 theorems list) extracted into compilable snippets. It is released in a plain main version with one target theorem per instance and a premised version that expands each theorem into the main theorem plus automatically extracted supporting subtheorems, enabling measurement of partial progress through proof structure. The authors add theorem-level coverage and token-efficiency metrics.
Figure 3 : TheoremBench construction pipeline. Lean4 source files are parsed into theorem groups, enriched with required formal context, transformed into plain-main and premised instances, and verified by Lean4 before inclusion in the benchmark.
Results
Across four Lean4-capable provers under a 64-sample budget, explicit premises substantially raise pass@64 (DeepSeek-Prover-V2-7B from 0.053 plain to 0.460 premised). Provers remain biased toward easy subtheorems and produce long, inefficient tactic traces rather than compact proofs.
Figure 1 : Performance comparison of Lean4-capable theorem provers in the plain main and premised settings, indicating their theorem fully-proved ability. Premises help substantially for DeepSeek and Goedel-Prover-V2-8B, modestly for Kimina, and nothing for non-reasoning model Goedel-Prover-SFT.