Geoint-R1 generates Lean4 code for auxiliary geometric constructions, with a 1,885-problem benchmark of Lean4-annotated geometry problems.
Abstract
Mathematical geometric reasoning is essential for scientific discovery and educational development, requiring precise logic and rigorous formal verification. While recent advances in Multimodal Large Language Models (MLLMs) have improved reasoning tasks, existing models typically struggle with formal geometric reasoning, particularly when dynamically constructing and verifying auxiliary geometric elements. To address these challenges, we introduce Geoint-R1, a multimodal reasoning framework designed to generate formally verifiable geometric solutions from textual descriptions and visual diagrams. Geoint-R1 uniquely integrates auxiliary elements construction, formal reasoning represented via Lean4, and interactive visualization. To systematically evaluate and advance formal geometric reasoning, we propose the Geoint benchmark, comprising 1,885 rigorously annotated geometry problems across diverse topics such as plane, spatial, and solid geometry. Each problem includes structured textual annotations, precise Lean4 code for auxiliary constructions, and detailed solution steps verified by experts. Extensive experiments demonstrate that Geoint-R1 significantly surpasses existing multimodal and math-specific reasoning models, particularly on challenging problems requiring explicit auxiliary element constructions.
Problem
Multimodal LLMs struggle with formal geometric reasoning, especially dynamically constructing and verifying auxiliary geometric elements.
Approach
Geoint-R1 generates formally verifiable geometric solutions from text and diagrams, integrating auxiliary-element construction, formal reasoning represented in Lean4, and interactive visualization, trained via SFT then reinforcement learning with a verification reward. The Geoint benchmark provides 1,885 annotated problems with Lean4 code for auxiliary constructions.
Figure 2: The Geoint-R1 architecture and training pipeline.
Results
Geoint-R1 surpasses existing multimodal and math-specific reasoning models, particularly on problems requiring explicit auxiliary constructions; ablations show the verification reward, RL, and curriculum each contribute.
Figure 4: Accuracy (%) on auxiliary line problems.