Builds 18k instruction-response pairs across five formal languages including Lean4, evaluating and fine-tuning LLMs on verifiable formal proofs.
Abstract
The research in AI-based formal mathematical reasoning has shown an unstoppable growth trend. These studies have excelled in mathematical competitions like IMO and have made significant progress. This paper focuses on formal verification, an immediate application scenario of formal reasoning, and breaks it down into sub-tasks. We constructed 18k high-quality instruction-response pairs across five formal specification languages (Coq, Lean4, Dafny, ACSL, and TLA+) by distilling gpt-4o and evaluated against ten open-sourced LLMs, including recent popular DeepSeek-R1. We also fine-tuned several 7~8B small models to achieve comparable performance with Deepseek-R1-671B. Interestingly, we observed that fine-tuning with formal data also enhances mathematics, reasoning, and coding capabilities. Fine-tuned models are released at https: //huggingface.co/fm-universe.
Problem
LLM capability on formal verification, as opposed to competition math, is under-evaluated across the range of formal specification languages and sub-tasks.
Approach
The authors construct 18k instruction-response pairs across five formal specification languages (Coq, Lean4, Dafny, ACSL, TLA+) by distilling GPT-4o, break formal verification into six sub-tasks, evaluate ten open-source LLMs, and fine-tune several 7-8B models.
Figure 2: Six tasks towards Informal to Formal Verification
Results
Fine-tuned 7-8B models reach performance comparable to DeepSeek-R1-671B, and fine-tuning on formal data also improves mathematics, reasoning, and coding; the Lean4 split contains 1,163 proofs and 1,578 segments.
Spec
Proofs
Segments
Coq
2,126
14,939
Lean4
1,163
1,578
ACSL
544
765
TLA+
256
594
Dafny
249
417
Formal-language proof/segment counts in the dataset