Towards Advanced Mathematical Reasoning for LLMs via First-Order Logic Theorem Proving

Chuxue Cao, Mengze Li, Juntao Dai, Jinluan Yang, Zijian Zhao, Shengyu Zhang, Weijie Shi, Chengzhong Liu, Sirui Han, Yike Guo

cs.AI Jun 20, 2025 · v1

theorem-proving-ml dataset logic

Read PDF arXiv abstract

TL;DR

Releases a curated dataset of 447 mathematical theorems in Lean 4 format for evaluating LLM first-order-logic theorem proving.

Abstract

Large language models (LLMs) have shown promising first-order logic (FOL) reasoning capabilities with applications in various areas. However, their effectiveness in complex mathematical reasoning involving multi-step FOL deductions is still under-researched. While LLMs perform competitively on established mathematical reasoning benchmarks, they struggle with multi-step FOL tasks, as demonstrated by Deepseek-Prover-V2-7B's low accuracy (4.2%) on our proposed theorem proving dataset. This issue arises from the limited exploration of diverse proof strategies and the potential for early reasoning mistakes to undermine entire proofs. To address these issues, we propose DREAM, a self-adaptive solution that enhances the Diversity and REAsonability of LLMs' generation strategies. DREAM incorporates an Axiom-Driven Strategy Diversification mechanism to promote varied strategic outcomes and a Sub-Proposition Error Feedback to help LLMs reflect on and correct their proofs. Our contributions include pioneering advancements in LLMs' mathematical reasoning through FOL theorem proving, introducing a novel inference stage solution that improves performance by 0.6% to 6.4%, and providing a curated dataset of 447 mathematical theorems in Lean 4 format for evaluation.

Problem

LLMs show promising first-order logic reasoning capabilities but struggle with multi-step FOL theorem proving. Deepseek-Prover-V2-7B achieves only 4.2% accuracy on a proposed FOL theorem proving dataset. Limited exploration of diverse proof strategies and early mistakes undermine entire proofs.

Approach

DREAM (Diversity and REAsonability) is a self-adaptive inference-stage solution. It uses Axiom-Driven Strategy Diversification to promote varied proof strategies by constructing combinatorial axiom sets, and Sub-Proposition Error Feedback to help LLMs reflect on and correct intermediate steps. A curated dataset of 447 mathematical theorems in Lean 4 format is provided for evaluation.

Results

DREAM improves performance by 0.6% to 6.4% across models. On the proposed FOL-MATH dataset, Claude 3.5 with DREAM achieves 10.1% average accuracy (vs 0.2% for repeated sampling), and DeepSeek-Prover-V2-7B with DREAM achieves 8.3% (vs 4.2% baseline).

Model + Method	Avg.
Claude 3.5 + Repeated	0.2%
Claude 3.5 + Subgoal	3.7%
Claude 3.5 + DREAM	10.1%
DSP-V2-7B + Repeated	4.2%
DSP-V2-7B + DREAM	8.3%

Results on FOL-MATH dataset (selected)

Papers With

Towards Advanced Mathematical Reasoning for LLMs via First-Order Logic Theorem Proving