Theorem Prover as a Judge for Synthetic Data Generation

Joshua Ong Jun Leang, Giwon Hong, Wenda Li, Shay B. Cohen

cs.AI Feb 18, 2025 · v1

autoformalization theorem-proving-ml dataset

Read PDF arXiv abstract Code

TL;DR

Uses the Lean prover to verify autoformalized LLM reasoning steps, judging synthetic data and providing reinforcement-learning feedback.

Abstract

The demand for synthetic data in mathematical reasoning has increased due to its potential to enhance the mathematical capabilities of large language models (LLMs). However, ensuring the validity of intermediate reasoning steps remains a significant challenge, affecting data quality. While formal verification via theorem provers effectively validates LLM reasoning, the autoformalisation of mathematical proofs remains error-prone. In response, we introduce iterative autoformalisation, an approach that iteratively refines theorem prover formalisation to mitigate errors, thereby increasing the execution rate on the Lean prover from 60% to 87%. Building upon that, we introduce Theorem Prover as a Judge (TP-as-a-Judge), a method that employs theorem prover formalisation to rigorously assess LLM intermediate reasoning, effectively integrating autoformalisation with synthetic data generation. Finally, we present Reinforcement Learning from Theorem Prover Feedback (RLTPF), a framework that replaces human annotation with theorem prover feedback in Reinforcement Learning from Human Feedback (RLHF). Across multiple LLMs, applying TP-as-a-Judge and RLTPF improves benchmarks with only 3,508 samples, achieving 5.56% accuracy gain on Mistral-7B for MultiArith, 6.00% on Llama-2-7B for SVAMP, and 3.55% on Llama-3.1-8B for AQUA.

Problem

Ensuring validity of intermediate reasoning steps in LLM-generated mathematical solutions is difficult, affecting synthetic training data quality. Autoformalisation of mathematical proofs into theorem provers remains error-prone, with only about 60% execution rate on the Lean prover.

Approach

Iterative autoformalisation refines theorem prover formalisation over multiple rounds to mitigate errors. TP-as-a-Judge uses theorem prover formalisation to assess LLM intermediate reasoning and filter synthetic data. RLTPF (Reinforcement Learning from Theorem Prover Feedback) replaces human annotation with theorem prover verification signals in the RLHF framework, using SFT and DPO on verified/false reasoning chains.

Results

Iterative autoformalisation increases Lean execution rate from 60% to 87%. With only 3,508 training samples, the method achieves 5.56% accuracy gain on Mistral-7B for MultiArith, 6.00% on Llama-2-7B for SVAMP, and 3.55% on Llama-3.1-8B for AQUA.

Papers With

Theorem Prover as a Judge for Synthetic Data Generation