VeriSoftBench: Repository-Scale Formal Verification Benchmarks for Lean

Yutong Xin, Qiaochu Chen, Greg Durrett, Işil Dillig

cs.SE Feb 20, 2026 · v1

benchmark verification theorem-proving-ml

Read PDF arXiv abstract Code

TL;DR

Introduces VeriSoftBench, a benchmark of 500 Lean 4 proof obligations from open-source formal-methods software repositories.

Abstract

Large language models have achieved striking results in interactive theorem proving, particularly in Lean. However, most benchmarks for LLM-based proof automation are drawn from mathematics in the Mathlib ecosystem, whereas proofs in software verification are developed inside definition-rich codebases with substantial project-specific libraries. We introduce VeriSoftBench, a benchmark of 500 Lean 4 proof obligations drawn from open-source formal-methods developments and packaged to preserve realistic repository context and cross-file dependencies. Our evaluation of frontier LLMs and specialized provers yields three observations. First, provers tuned for Mathlib-style mathematics transfer poorly to this repository-centric setting. Second, success is strongly correlated with transitive repository dependence: tasks whose proofs draw on large, multi-hop dependency closures are less likely to be solved. Third, providing curated context restricted to a proof's dependency closure improves performance relative to exposing the full repository, but nevertheless leaves substantial room for improvement. Our benchmark and evaluation suite are released at https://github.com/utopia-group/VeriSoftBench.

Problem

Most benchmarks for LLM-based theorem proving in Lean are drawn from Mathlib-style mathematics and do not reflect the definition-rich, multi-module context of software verification projects.

Approach

VeriSoftBench is a benchmark of 500 Lean 4 proof obligations drawn from open-source formal-methods repositories, packaged to preserve realistic cross-file dependencies. The authors evaluate frontier LLMs and specialized provers under two context conditions: curated context restricted to a proof's dependency closure, and full repository context.

Results

Provers tuned for Mathlib transfer poorly to this setting. Gemini-3-Pro achieves the best result at 41.0% (Pass@8, curated context) versus 34.8% (full repo). Goedel-Prover-v2, a specialized prover, scores only 5.6% with curated context and 0% with full repo context. Success correlates strongly with transitive dependency closure size.

Model	Curated Context	Full Repo Context
Claude Opus 4.5 (Pass@8)	31.2%	23.2%
GPT-5.2 (Pass@8)	12.6%	10.8%
Gemini-3-Pro (Pass@8)	41.0%	34.8%
Goedel-Prover-v2 (Pass@8)	5.6%	0.0%

Performance on VeriSoftBench-Full

Papers With

VeriSoftBench: Repository-Scale Formal Verification Benchmarks for Lean