← All papers
RLMEval: Evaluating Research-Level Neural Theorem Proving
cs.CL
Oct 29, 2025 · v1
TL;DR
Builds an evaluation suite of 613 theorems drawn from six real Lean Blueprint formalization projects.
Abstract
Despite impressive results on curated benchmarks, the practical impact of large language models (LLMs) on research-level neural theorem proving and proof autoformalization is still limited. We introduce RLMEval, an evaluation suite for these tasks, focusing on research-level mathematics from real-world Lean formalization projects. RLMEval targets the evaluation of neural theorem proving and proof autoformalization on challenging research-level theorems by leveraging real Lean Blueprint formalization projects. Our evaluation of state-of-the-art models on RLMEval, comprising 613 theorems from 6 Lean projects, reveals a significant gap: progress on existing benchmarks does not readily translate to these more realistic settings, with the best model achieving only a 10.3 % pass rate. RLMEval provides a new, challenging benchmark designed to guide and accelerate progress in automated reasoning for formal mathematics.
Problem
Despite impressive results on curated benchmarks, the practical impact of LLMs on research-level neural theorem proving and proof autoformalization remains limited, and existing benchmarks do not reflect real-world difficulty.
Approach
RLMEval is an evaluation suite for neural theorem proving and proof autoformalization that targets research-level mathematics from real Lean formalization projects. It leverages real Lean Blueprint formalization projects to construct challenging evaluation sets, comprising 613 theorems from 6 Lean projects.
Results
The best model achieves only a 10.3% pass rate on RLMEval, revealing a significant gap between performance on existing benchmarks and more realistic research-level settings.
