CLEVER: A Curated Benchmark for Formally Verified Code Generation
Amitayush Thakur, Jasper Lee, George Tsoukalas, Meghana Sistla, Matthew Zhao, Stefan Zetzsche, Greg Durrett, Yisong Yue, Swarat Chaudhuri
cs.LG
May 20, 2025 · v1
TL;DR
Introduces CLEVER, a curated 161-problem Lean benchmark for end-to-end formally verified specification and code generation.
Abstract
We introduce ${\rm C{\small LEVER}}$, a high-quality, curated benchmark of 161 problems for end-to-end verified code generation in Lean. Each problem consists of (1) the task of generating a specification that matches a held-out ground-truth specification, and (2) the task of generating a Lean implementation that provably satisfies this specification. Unlike prior benchmarks, ${\rm C{\small LEVER}}$ avoids test-case supervision, LLM-generated annotations, and specifications that leak implementation logic or allow vacuous solutions. All outputs are verified post-hoc using Lean's type checker to ensure machine-checkable correctness. We use ${\rm C{\small LEVER}}$ to evaluate several few-shot and agentic approaches based on state-of-the-art language models. These methods all struggle to achieve full verification, establishing it as a challenging frontier benchmark for program synthesis and formal reasoning. Our benchmark can be found on GitHub(
https://github.com/trishullab/clever) as well as HuggingFace(
https://huggingface.co/datasets/amitayusht/clever). All our evaluation code is also available online(
https://github.com/trishullab/clever-prover).
Problem
Existing benchmarks for verified code generation suffer from test-case supervision, LLM-generated annotations, and specifications that leak implementation logic or allow vacuous solutions, making them unreliable measures of formal verification capability.
Approach
CLEVER is a curated benchmark of 161 problems for end-to-end verified code generation in Lean, adapted from HumanEval. Each problem requires (1) generating a specification matching a held-out ground truth, and (2) generating a Lean implementation provably satisfying that specification. All outputs are verified post-hoc using Lean's type checker. The benchmark evaluates both specification certification (proving equivalence to ground truth) and implementation certification (proving correctness against the spec).
Results
State-of-the-art LLMs and agentic approaches struggle to achieve full verification, with end-to-end certification rates remaining low. The benchmark establishes a challenging frontier for program synthesis and formal reasoning where specification generation, equivalence proving, implementation, and correctness proving must all succeed together.