Lean-GAP: A Dataset of Formalized Graduate Algebra Problems
Formalized mathematics in proof assistants remains limited relative to the vast informal corpus. In particular, standard graduate-level textbook material that forms the foundation of mathematical training is underrepresented, leaving a gap for training and evaluating automated reasoning systems.
The authors present Lean-GAP, 430 formalized graduate-level algebra problems from Dummit and Foote's Abstract Algebra textbook. Their pipeline consists of PDF-to-LaTeX preprocessing via OCR, autoformalization into Lean 4 using multiple LLMs (GPT-5, Gemma4-31B, Goedel-Formalizer-V2-32B, Qwen3.6, DeepSeek-R1, and Codex in an agent loop), and human verification of informal-formal correspondence. They also develop suspicious-statement pattern detectors and LLM-based semantic checking.
GPT-5 achieves 44.1% any-pass elaboration rate among single-shot systems, while Codex in an agent loop reaches 95.5%. Semantic faithfulness evaluation shows Codex achieves 51.2% faithful formalizations versus 45.1% for GPT-5. The verification stage remains the most labor-intensive component.
| Model | any_pass | all_pass | fail | missing |
|---|---|---|---|---|
| GPT-5 | 867 (44.1%) | 577 | 942 | 157 |
| Goedel-Formalizer-V2-32B | 711 (36.2%) | 658 | 1253 | 2 |
| Codex (agent loop) | 1877 (95.5%) | 1857 | 89 | 0 |
