← All papers
First page of From Informal to Formal -- Incorporating and Evaluating LLMs on Natural Language Requirements to Verifiable Formal Proofs

From Informal to Formal -- Incorporating and Evaluating LLMs on Natural Language Requirements to Verifiable Formal Proofs

Jialun Cao, Yaojie Lu, Meiziniu Li, Haoyang Ma, Haokun Li, Mengda He, Cheng Wen, Le Sun, Hongyu Zhang, Shengchao Qin, Shing-Chi Cheung, Cong Tian

cs.AI Jan 27, 2025 · v4 cs.CL cs.PL
Builds 18k instruction-response pairs across five formal languages including Lean4, evaluating and fine-tuning LLMs on verifiable formal proofs.
The research in AI-based formal mathematical reasoning has shown an unstoppable growth trend. These studies have excelled in mathematical competitions like IMO and have made significant progress. This paper focuses on formal verification, an immediate application scenario of formal reasoning, and breaks it down into sub-tasks. We constructed 18k high-quality instruction-response pairs across five formal specification languages (Coq, Lean4, Dafny, ACSL, and TLA+) by distilling gpt-4o and evaluated against ten open-sourced LLMs, including recent popular DeepSeek-R1. We also fine-tuned several 7~8B small models to achieve comparable performance with Deepseek-R1-671B. Interestingly, we observed that fine-tuning with formal data also enhances mathematics, reasoning, and coding capabilities. Fine-tuned models are released at https: //huggingface.co/fm-universe.

LLM capability on formal verification, as opposed to competition math, is under-evaluated across the range of formal specification languages and sub-tasks.

The authors construct 18k instruction-response pairs across five formal specification languages (Coq, Lean4, Dafny, ACSL, TLA+) by distilling GPT-4o, break formal verification into six sub-tasks, evaluate ten open-source LLMs, and fine-tune several 7-8B models.

Figure 2: Six tasks towards Informal to Formal Verification

Fine-tuned 7-8B models reach performance comparable to DeepSeek-R1-671B, and fine-tuning on formal data also improves mathematics, reasoning, and coding; the Lean4 split contains 1,163 proofs and 1,578 segments.

SpecProofsSegments
Coq2,12614,939
Lean41,1631,578
ACSL544765
TLA+256594
Dafny249417
Formal-language proof/segment counts in the dataset