Mathematics with large language models as provers and verifiers
Hieu Le Duc, Leo Liberti
cs.CL
Oct 11, 2025 · v1
TL;DR
Final LLM-generated proofs are formally verified by the Lean proof assistant to guard against hallucinated reasoning.
Abstract
During 2024 and 2025 the discussion about the theorem-proving capabilities of large language models started reporting interesting success stories, mostly to do with difficult exercises (such as problems from the International Mathematical Olympiad), but also with conjectures [Feldman & Karbasi, arXiv:2509.18383v1] formulated for the purpose of verifying whether the artificial intelligence could prove it. In this paper we report a theorem proving feat achieved by ChatGPT by using a protocol involving different prover and verifier instances of the gpt-5 model working collaboratively. To make sure that the produced proofs do not suffer from hallucinations, the final proof is formally verified by the lean proof assistant, and the conformance of premises and conclusion of the lean code is verified by a human. Our methodology is by no means complete or exact. It was nonetheless able to solve five out of six 2025 IMO problems, and close about a third of the sixty-six number theory conjectures in [Cohen, Journal of Integer Sequences, 2025].
Problem
Whether large language models can reliably prove mathematical theorems without hallucination, and whether a prover-verifier protocol can scale to competition-level problems and open conjectures.
Approach
The paper uses a Test-Time Verify-Revise (TTVR) protocol with two gpt-5 instances (a prover and a verifier) interacting iteratively. The prover generates candidate proofs, the verifier checks them and provides feedback, and the loop continues until the verifier accepts. Final proofs are formally verified in Lean, with a human confirming the Lean code matches the claimed statement.
Results
The protocol solved 5 out of 6 2025 IMO problems and closed about a third (20 out of 66) of open number theory conjectures from Cohen (2025). Proofs are Lean-certified where possible.
| Problem | 1 | 2 | 3 | 4 | 5 | 6 |
|---|
| Iterations | 9 | 7 | 3 | 6 | 5 | NA |
| Correct? | Y | Y | Y | Y | Y | NA |
2025 IMO results: iterations and correctness