Lean-GAP: A Dataset of Formalized Graduate Algebra Problems

Seewoo Lee, Byung-Hak Hwang, Hyojae Lim, Jihoon Hyun, Ilkyoo Choi, Yeachan Park, Jineon Baek, Hyukpyo Hong, Keewoo Lee, Jaeseong Heo, Hyungryul Baik, Chul-hee Lee, Kyu-Hwan Lee

cs.LO May 20, 2026 · v1

dataset autoformalization algebra

Read PDF arXiv abstract Code

TL;DR

Builds Lean-GAP, 430 graduate algebra problems autoformalized into Lean 4 with a verification pipeline.

Abstract

We present Lean-GAP (Lean-Graduate Agebra Problems), 430 formalized graduate-level algebra problems from the textbook Abstract Algebra by Dummit and Foote. We develop a scalable pipeline consisting of PDF-to-LaTeX preprocessing, autoformalization into Lean 4, and verification of informal-formal correspondence. While the preprocessing and autoformalization stages can be largely automated, we find that verification remains the most subtle and labor-intensive component, requiring careful human oversight. Our contributions include (i) the construction of a structured dataset of formalized exercises, (ii) a systematic methodology for formalizing textbook mathematics, and (iii) an analysis of recurring challenges in the formalization process. We also compare the performance of different autoformalization models and highlight key bottlenecks in translating informal statements into formal language.

Problem

Formalized mathematics in proof assistants remains limited relative to the vast informal corpus. In particular, standard graduate-level textbook material that forms the foundation of mathematical training is underrepresented, leaving a gap for training and evaluating automated reasoning systems.

Approach

The authors present Lean-GAP, 430 formalized graduate-level algebra problems from Dummit and Foote's Abstract Algebra textbook. Their pipeline consists of PDF-to-LaTeX preprocessing via OCR, autoformalization into Lean 4 using multiple LLMs (GPT-5, Gemma4-31B, Goedel-Formalizer-V2-32B, Qwen3.6, DeepSeek-R1, and Codex in an agent loop), and human verification of informal-formal correspondence. They also develop suspicious-statement pattern detectors and LLM-based semantic checking.

Results

GPT-5 achieves 44.1% any-pass elaboration rate among single-shot systems, while Codex in an agent loop reaches 95.5%. Semantic faithfulness evaluation shows Codex achieves 51.2% faithful formalizations versus 45.1% for GPT-5. The verification stage remains the most labor-intensive component.

Model	any_pass	all_pass	fail	missing
GPT-5	867 (44.1%)	577	942	157
Goedel-Formalizer-V2-32B	711 (36.2%)	658	1253	2
Codex (agent loop)	1877 (95.5%)	1857	89	0

Elaboration results by model

Papers With

Lean-GAP: A Dataset of Formalized Graduate Algebra Problems