IndiMathBench: Autoformalizing Mathematical Reasoning Problems with a Human Touch

Param Biyani, Shashank Kirtania, Yasharth Bajpai, Sumit Gulwani, Ashish Tiwari

cs.AI Nov 30, 2025 · v1

benchmark autoformalization dataset theorem-proving-ml

Read PDF arXiv abstract Code

TL;DR

Introduces IndiMathBench, 312 human-verified Lean 4 theorems from Indian Math Olympiads, built via compiler-feedback autoformalization.

Abstract

Reliable autoformalization remains challenging even in the era of large language models (LLMs). The scarcity of high-quality training data is a major bottleneck. Expert annotation requires substantial time and deep expertise in both mathematics and theorem proving. We introduce IndiMathBench, a human-verified benchmark designed to evaluate mathematical theorem proving, curated using an AI-powered human-assisted pipeline for formalizing natural language problems in Lean. IndiMathBench is composed of 312 formal Lean 4 theorems paired with their corresponding informal problem statements, sourced from Indian Mathematics Olympiads. Through category-based retrieval, iterative compiler feedback, and multi-model ensembles, our pipeline generates candidate formalizations that experts efficiently validate via an interactive dashboard with automated quality summaries. Evaluation across multiple frontier models demonstrates that autoformalization remains challenging, with substantial gaps between syntactic validity and semantic correctness, while theorem proving success rates remain low even with iterative refinement, demonstrating that \benchmark~presents a challenging testbed for mathematical reasoning. IndiMathBench is available at https://github.com/prmbiy/IndiMathBench.

Problem

Reliable autoformalization remains challenging due to scarcity of high-quality paired informal-formal training data. Expert annotation requires deep expertise in both mathematics and theorem proving, and existing benchmarks lack diversity in cultural scope and mathematical coverage.

Approach

IndiMathBench uses an AI-powered human-assisted pipeline: category-based retrieval, iterative compiler feedback, and multi-model ensembles generate candidate formalizations in Lean 4, which experts validate via an interactive dashboard with automated quality summaries. The benchmark draws from Indian Mathematics Olympiads (RMO and INMO) covering geometry, algebra, number theory, and combinatorics.

Results

The benchmark contains 312 formal Lean 4 theorems paired with informal statements. Evaluation shows substantial gaps between syntactic validity and semantic correctness across frontier models: Claude Opus 4 achieves 54 BEq-passing problems, compilation pass rates range from 38.5% (GPT-4.1) to 77.9% (Claude Opus 4) with documentation and feedback. Theorem proving success rates remain low even with iterative refinement.

Model	BEq (312)	Compile Rate (Doc+FB)
Claude Opus 4	54	77.9%
GPT-5	38	75.3%
Claude Sonnet 4	51	68.6%
o3 (high)	32	65.7%
Gemini 2.5 Pro	47	48.4%

Model performance on IndiMathBench

Papers With

IndiMathBench: Autoformalizing Mathematical Reasoning Problems with a Human Touch