From Informal to Formal -- Incorporating and Evaluating LLMs on Natural Language Requirements to Verifiable Formal Proofs

Jialun Cao, Yaojie Lu, Meiziniu Li, Haoyang Ma, Haokun Li, Mengda He, Cheng Wen, Le Sun, Hongyu Zhang, Shengchao Qin, Shing-Chi Cheung, Cong Tian

cs.AI Jan 27, 2025 · v4 cs.CL cs.PL

benchmark autoformalization

Read PDF arXiv abstract Code

TL;DR

Builds 18k instruction-response pairs across five formal languages including Lean4, evaluating and fine-tuning LLMs on verifiable formal proofs.

Abstract

The research in AI-based formal mathematical reasoning has shown an unstoppable growth trend. These studies have excelled in mathematical competitions like IMO and have made significant progress. This paper focuses on formal verification, an immediate application scenario of formal reasoning, and breaks it down into sub-tasks. We constructed 18k high-quality instruction-response pairs across five formal specification languages (Coq, Lean4, Dafny, ACSL, and TLA+) by distilling gpt-4o and evaluated against ten open-sourced LLMs, including recent popular DeepSeek-R1. We also fine-tuned several 7~8B small models to achieve comparable performance with Deepseek-R1-671B. Interestingly, we observed that fine-tuning with formal data also enhances mathematics, reasoning, and coding capabilities. Fine-tuned models are released at https: //huggingface.co/fm-universe.

Problem

LLM capability on formal verification, as opposed to competition math, is under-evaluated across the range of formal specification languages and sub-tasks.

Approach

The authors construct 18k instruction-response pairs across five formal specification languages (Coq, Lean4, Dafny, ACSL, TLA+) by distilling GPT-4o, break formal verification into six sub-tasks, evaluate ten open-source LLMs, and fine-tune several 7-8B models.

Figure 2: Six tasks towards Informal to Formal Verification

Results

Fine-tuned 7-8B models reach performance comparable to DeepSeek-R1-671B, and fine-tuning on formal data also improves mathematics, reasoning, and coding; the Lean4 split contains 1,163 proofs and 1,578 segments.

Spec	Proofs	Segments
Coq	2,126	14,939
Lean4	1,163	1,578
ACSL	544	765
TLA+	256	594
Dafny	249	417

Formal-language proof/segment counts in the dataset

Papers With

From Informal to Formal -- Incorporating and Evaluating LLMs on Natural Language Requirements to Verifiable Formal Proofs