← All papers

Agentic Proof Automation: A Case Study

Yichen Xu, Martin Odersky

cs.PL Jan 7, 2026 · v1

verification ai-agents type-theory

Read PDF arXiv abstract Code

TL;DR

Case study mechanizing System Capless type soundness in Lean 4 via agentic LLM proof automation.

Abstract

Proof engineering is notoriously labor-intensive: proofs that are straightforward on paper often require lengthy scripts in theorem provers. Recent advances in large language models (LLMs) create new opportunities for proof automation: modern LLMs not only generate proof scripts, but also support agentic behavior, exploring codebases and iteratively refining their outputs against prover feedback. These advances enable an emerging scheme where LLM-based agents undertake most proof engineering under human guidance. Humans provide mathematical insight (definitions, theorems, proof strategies); agents handle the mechanical work of proof development. We call this scheme agentic proof automation. We present this scheme through a case study: mechanizing the semantic type soundness of a sophisticated formal system, System Capless, in Lean 4, comprising over 14,000 lines of code. Using off-the-shelf LLM agents with a single lightweight proof-checking tool, the agents completed 189 proof engineering tasks with an 87% success rate, only 16% requiring human intervention. The case study demonstrates that agents are capable proof engineers that substantially boost productivity, though they fall short in creative reasoning and still require human guidance in certain cases. We release an interactive explorer where readers can examine all agent interactions; the mechanization is open-sourced for experiments and extensions.

Problem

Proof engineering is labor-intensive: proofs straightforward on paper often require lengthy scripts in theorem provers. Whether LLM-based agents can handle the mechanical work of proof development under human guidance is an open question.

Approach

The authors present agentic proof automation through a case study: mechanizing semantic type soundness of System Capless in Lean 4 (over 14,000 lines of code). Off-the-shelf LLM agents with a single lightweight proof-checking tool undertake proof engineering tasks while humans provide mathematical insight (definitions, theorems, proof strategies).

Results

The agents completed 189 proof engineering tasks with an 87% success rate; only 16% required human intervention. Tasks spanned proof (80% success), repair (90%), refactor (80%), state+prove (87%), query (100%), and chore (100%). Agents are capable proof engineers that boost productivity but fall short in creative reasoning.

Category	Total	Success	Intervention
Proof	51	41	20%
Repair	48	43	8%
Refactor	35	28	23%
State+Prove	23	20	35%
Query	21	21	0%
Total	189	164	16%

Task results by category

Papers With

Agentic Proof Automation: A Case Study