Ineq-Comp: Benchmarking Human-Intuitive Compositional Reasoning in Automated Theorem Proving on Inequalities

Haoyu Zhao, Yihan Geng, Shange Tang, Yong Lin, Bohan Lyu, Hongzhou Lin, Chi Jin, Sanjeev Arora

cs.AI May 19, 2025 · v1

benchmark theorem-proving-ml dataset algebra

Read PDF arXiv abstract Code

TL;DR

Introduces Ineq-Comp, a Lean benchmark of transformed inequalities that evaluates LLM-based Lean provers' compositional reasoning.

Abstract

LLM-based formal proof assistants (e.g., in Lean) hold great promise for automating mathematical discovery. But beyond syntactic correctness, do these systems truly understand mathematical structure as humans do? We investigate this question in context of mathematical inequalities -- specifically the prover's ability to recognize that the given problem simplifies by applying a known inequality such as AM/GM. Specifically, we are interested in their ability to do this in a compositional setting where multiple inequalities must be applied as part of a solution. We introduce Ineq-Comp, a benchmark built from elementary inequalities through systematic transformations, including variable duplication, algebraic rewriting, and multi-step composition. Although these problems remain easy for humans, we find that most provers -- including Goedel, STP, and Kimina-7B -- struggle significantly. DeepSeek-Prover-V2-7B shows relative robustness, but still suffers a 20% performance drop (pass@32). Even for DeepSeek-Prover-V2-671B model, the gap between compositional variants and seed problems exists, implying that simply scaling up the model size alone does not fully solve the compositional weakness. Strikingly, performance remains poor for all models even when formal proofs of the constituent parts are provided in context, revealing that the source of weakness is indeed in compositional reasoning. Our results expose a persisting gap between the generalization behavior of current AI provers and human mathematical intuition. All data and evaluation code can be found at https://github.com/haoyuzhao123/LeanIneqComp.

Problem

LLM-based formal provers in Lean can produce syntactically correct proofs, but it is unclear whether they understand mathematical structure in a human-intuitive way, particularly the ability to compositionally apply known inequalities like AM/GM.

Approach

Ineq-Comp is a benchmark built from elementary inequalities through systematic transformations including variable duplication, algebraic rewriting, and multi-step composition. Although these problems remain easy for humans, the benchmark tests whether provers can recognize compositional applications of known inequalities. The authors also test whether providing formal proofs of constituent parts in context helps.

Results

Most provers struggle significantly on compositional variants. DeepSeek-Prover-V2-7B shows relative robustness but still suffers a 20% drop (pass@32). Even DeepSeek-Prover-V2-671B shows a gap between compositional variants and seed problems. Performance remains poor even when formal proofs of constituent parts are provided in context, confirming the weakness is in compositional reasoning itself.

Papers With

Ineq-Comp: Benchmarking Human-Intuitive Compositional Reasoning in Automated Theorem Proving on Inequalities