← All papers

LemmaHead: RAG Assisted Proof Generation Using Large Language Models

Tianbo Yang, Mingqi Yan, Hongyi Zhao, Tianshuo Yang

cs.LG Jan 27, 2025 · v1

theorem-proving-ml autoformalization

Read PDF arXiv abstract

TL;DR

Tests RAG-assisted LLM proof generation by producing proofs to mathematical claims in the Lean formal language.

Abstract

Developing the logic necessary to solve mathematical problems or write mathematical proofs is one of the more difficult objectives for large language models (LLMS). Currently, the most popular methods in literature consists of fine-tuning the model on written mathematical content such as academic publications and textbooks, so that the model can learn to emulate the style of mathematical writing. In this project, we explore the effectiveness of using retrieval augmented generation (RAG) to address gaps in the mathematical reasoning of LLMs. We develop LemmaHead, a RAG knowledge base that supplements queries to the model with relevant mathematical context, with particular focus on context from published textbooks. To measure our model's performance in mathematical reasoning, our testing paradigm focuses on the task of automated theorem proving via generating proofs to a given mathematical claim in the Lean formal language.

Problem

LLMs struggle with the logic necessary for mathematical proof generation, and fine-tuning alone may not address fundamental gaps in mathematical reasoning.

Approach

The authors develop LemmaHead, a retrieval augmented generation (RAG) knowledge base that supplements LLM queries with relevant mathematical context from published textbooks. They explore three RAG pipeline variants: basic RAG, enhanced query generation (EQG) that reformulates queries for better retrieval, and iterative proof augmentation (IPA) that iteratively refines proofs using retrieved context. Performance is measured on Lean formal proof generation.

Results

RAG-assisted GPT-4 with IPA achieves 40.0% proof generation success compared to 9.4% for baseline GPT-4 and 23.9% for GPT-f. Enhanced query generation alone reaches 25.2%.

Method	Success Rate
GPT-4	9.4%
RAG-assisted GPT-4	2.3%
RAG-assisted GPT-4 with EQG	25.2%
RAG-assisted GPT-4 with IPA	40.0%
Human-guided GPT-4	11.5%
GPT-f	23.9%

Proof generation success rates

Papers With

LemmaHead: RAG Assisted Proof Generation Using Large Language Models