← All papers
First page of Lean-GAP: A Dataset of Formalized Graduate Algebra Problems

Lean-GAP: A Dataset of Formalized Graduate Algebra Problems

Seewoo Lee, Byung-Hak Hwang, Hyojae Lim, Jihoon Hyun, Ilkyoo Choi, Yeachan Park, Jineon Baek, Hyukpyo Hong, Keewoo Lee, Jaeseong Heo, Hyungryul Baik, Chul-hee Lee, Kyu-Hwan Lee

cs.LO May 20, 2026 · v1
Builds Lean-GAP, 430 graduate algebra problems autoformalized into Lean 4 with a verification pipeline.
We present Lean-GAP (Lean-Graduate Agebra Problems), 430 formalized graduate-level algebra problems from the textbook Abstract Algebra by Dummit and Foote. We develop a scalable pipeline consisting of PDF-to-LaTeX preprocessing, autoformalization into Lean 4, and verification of informal-formal correspondence. While the preprocessing and autoformalization stages can be largely automated, we find that verification remains the most subtle and labor-intensive component, requiring careful human oversight. Our contributions include (i) the construction of a structured dataset of formalized exercises, (ii) a systematic methodology for formalizing textbook mathematics, and (iii) an analysis of recurring challenges in the formalization process. We also compare the performance of different autoformalization models and highlight key bottlenecks in translating informal statements into formal language.

Formalized mathematics in proof assistants remains limited relative to the vast informal corpus. In particular, standard graduate-level textbook material that forms the foundation of mathematical training is underrepresented, leaving a gap for training and evaluating automated reasoning systems.

The authors present Lean-GAP, 430 formalized graduate-level algebra problems from Dummit and Foote's Abstract Algebra textbook. Their pipeline consists of PDF-to-LaTeX preprocessing via OCR, autoformalization into Lean 4 using multiple LLMs (GPT-5, Gemma4-31B, Goedel-Formalizer-V2-32B, Qwen3.6, DeepSeek-R1, and Codex in an agent loop), and human verification of informal-formal correspondence. They also develop suspicious-statement pattern detectors and LLM-based semantic checking.

GPT-5 achieves 44.1% any-pass elaboration rate among single-shot systems, while Codex in an agent loop reaches 95.5%. Semantic faithfulness evaluation shows Codex achieves 51.2% faithful formalizations versus 45.1% for GPT-5. The verification stage remains the most labor-intensive component.

Modelany_passall_passfailmissing
GPT-5867 (44.1%)577942157
Goedel-Formalizer-V2-32B711 (36.2%)65812532
Codex (agent loop)1877 (95.5%)1857890
Elaboration results by model