Simplifying Formal Proof-Generating Models with ChatGPT and Basic Searching Techniques
Sangjun Han, Taeil Hur, Youngmi Hur, Kathy Sangkyung Lee, Myungyoon Lee, Hyojae Lim
cs.LO
Feb 5, 2025 · v1
TL;DR
Combines ChatGPT with the Lean prover and simple search to generate verifiable formal proofs on the miniF2F dataset.
Abstract
The challenge of formal proof generation has a rich history, but with modern techniques, we may finally be at the stage of making actual progress in real-life mathematical problems. This paper explores the integration of ChatGPT and basic searching techniques to simplify generating formal proofs, with a particular focus on the miniF2F dataset. We demonstrate how combining a large language model like ChatGPT with a formal language such as Lean, which has the added advantage of being verifiable, enhances the efficiency and accessibility of formal proof generation. Despite its simplicity, our best-performing Lean-based model surpasses all known benchmarks with a 31.15% pass rate. We extend our experiments to include other datasets and employ alternative language models, showcasing our models' comparable performance in diverse settings and allowing for a more nuanced analysis of our results. Our findings offer insights into AI-assisted formal proof generation, suggesting a promising direction for future research in formal mathematical proof.
Problem
Generating formal proofs in Lean automatically remains difficult, and it is unclear how effectively a general-purpose LLM like ChatGPT can be combined with simple search techniques to produce verified proofs.
Approach
The authors integrate ChatGPT with basic searching techniques (including feedback-based algorithms) for generating Lean proofs on the miniF2F dataset. They use Lean's verifier as an oracle to confirm correctness, iterating with different feedback strategies (including a Bad(O)-based feedback algorithm that communicates error information back to the LLM).
Results
The best-performing Lean-based model achieves a 31.15% pass rate on miniF2F, surpassing all previously known benchmarks at the time. The approach demonstrates comparable performance across different datasets and alternative language models.