VeriSoftBench: Repository-Scale Formal Verification Benchmarks for Lean
Yutong Xin, Qiaochu Chen, Greg Durrett, Işil Dillig
cs.SE
Feb 20, 2026 · v1
TL;DR
Introduces VeriSoftBench, a benchmark of 500 Lean 4 proof obligations from open-source formal-methods software repositories.
Abstract
Large language models have achieved striking results in interactive theorem proving, particularly in Lean. However, most benchmarks for LLM-based proof automation are drawn from mathematics in the Mathlib ecosystem, whereas proofs in software verification are developed inside definition-rich codebases with substantial project-specific libraries. We introduce VeriSoftBench, a benchmark of 500 Lean 4 proof obligations drawn from open-source formal-methods developments and packaged to preserve realistic repository context and cross-file dependencies. Our evaluation of frontier LLMs and specialized provers yields three observations. First, provers tuned for Mathlib-style mathematics transfer poorly to this repository-centric setting. Second, success is strongly correlated with transitive repository dependence: tasks whose proofs draw on large, multi-hop dependency closures are less likely to be solved. Third, providing curated context restricted to a proof's dependency closure improves performance relative to exposing the full repository, but nevertheless leaves substantial room for improvement. Our benchmark and evaluation suite are released at
https://github.com/utopia-group/VeriSoftBench.
Problem
Most benchmarks for LLM-based theorem proving in Lean are drawn from Mathlib-style mathematics and do not reflect the definition-rich, multi-module context of software verification projects.
Approach
VeriSoftBench is a benchmark of 500 Lean 4 proof obligations drawn from open-source formal-methods repositories, packaged to preserve realistic cross-file dependencies. The authors evaluate frontier LLMs and specialized provers under two context conditions: curated context restricted to a proof's dependency closure, and full repository context.
Results
Provers tuned for Mathlib transfer poorly to this setting. Gemini-3-Pro achieves the best result at 41.0% (Pass@8, curated context) versus 34.8% (full repo). Goedel-Prover-v2, a specialized prover, scores only 5.6% with curated context and 0% with full repo context. Success correlates strongly with transitive dependency closure size.
| Model | Curated Context | Full Repo Context |
|---|
| Claude Opus 4.5 (Pass@8) | 31.2% | 23.2% |
| GPT-5.2 (Pass@8) | 12.6% | 10.8% |
| Gemini-3-Pro (Pass@8) | 41.0% | 34.8% |
| Goedel-Prover-v2 (Pass@8) | 5.6% | 0.0% |
Performance on VeriSoftBench-Full