A benchmark for vericoding: formally verified program synthesis

Sergiu Bursuc, Theodore Ehrenborg, Shaowei Lin, Lacramioara Astefanoaei, Ionel Emilian Chiosa, Jure Kukovec, Alok Singh, Oliver Butterley, Adem Bizid, Quinn Dougherty, Miranda Zhao, Max Tan, Max Tegmark

cs.SE Sep 26, 2025 · v1

benchmark dataset verification theorem-proving-ml

Read PDF arXiv abstract Code

TL;DR

Includes 7,141 Lean formal specifications in a vericoding benchmark and reports LLM verified-synthesis success rates on them.

Abstract

We present and test the largest benchmark for vericoding, LLM-generation of formally verified code from formal specifications - in contrast to vibe coding, which generates potentially buggy code from a natural language description. Our benchmark contains 12,504 formal specifications, with 3,029 in Dafny, 2,334 in Verus/Rust and 7,141 in Lean. Of these, 6,174 are new unseen problems. We find vericoding success rates of 27% in Lean, 44% in Verus/Rust and 82% in Dafny using off-the-shelf LLMs. Adding natural-language descriptions does not significantly improve performance. We also find that LLM progress has improved progress on pure Dafny verification from 68% to 96% over the past year. The benchmark and vericoding results are shared at https://github.com/Beneficial-AI-Foundation/vericoding-benchmark

Problem

Vibe coding (LLM-generation of code from natural language) produces potentially buggy code, and existing benchmarks for formally verified program synthesis (vericoding) are limited in scale and language coverage.

Approach

The authors present the largest benchmark for vericoding with 12,504 formal specifications across three languages: 3,029 in Dafny, 2,334 in Verus/Rust, and 7,141 in Lean. Of these, 6,174 are new unseen problems. They evaluate off-the-shelf LLMs on generating formally verified code from these specifications, and also test whether adding natural-language descriptions improves performance.

Results

Vericoding success rates are 27% in Lean, 44% in Verus/Rust, and 82% in Dafny using off-the-shelf LLMs. Adding natural-language descriptions does not significantly improve performance. LLM progress has improved pure Dafny verification from 68% to 96% over the past year.

Papers With

A benchmark for vericoding: formally verified program synthesis