Proving the Coding Interview: A Benchmark for Formally Verified Code Generation
Existing benchmarks for code generation verify programs against unit tests, which cannot guarantee correctness for all inputs. No large-scale benchmark existed for generating programs with machine-checked correctness proofs.
FVAPPS (Formally Verified Automated Programming Progress Standards) generalizes Python unit tests from the APPS benchmark into Lean 4 theorem statements. A pipeline converts APPS coding interview questions and solutions into Lean 4 definitions with associated correctness theorems. The benchmark includes 4715 samples (1083 curated and quality-controlled), making it the largest formal verification benchmark for code generation. Models must produce both code and proofs of correctness.
The benchmark contains 4715 samples with varying numbers of theorems per sample. Baseline evaluation shows Claude Sonnet solves 60 of 101 test samples and Gemini solves 43. Only 14 of 18 guarded (non-trivially verified) solutions from Sonnet are also plausible, indicating that current models often prove only trivial properties. The benchmark reveals that most model successes fall into categories like easy-branch-of-if, effectively-unit-test, or non-negativity-of-Nat.
| Category | Sonnet | Gemini |
|---|---|---|
| Unguarded | 41 | 28 |
| Guarded | 12 | 11 |
| Guarded and Plausible | 7 | 4 |
| Total | 60 | 43 |
