Aria: An Agent For Retrieval and Iterative Auto-Formalization via Dependency Graph

Hanyu Wang, Ruohan Xie, Yutong Wang, Guoxiong Gao, Xintao Yu, Bin Dong

cs.AI Oct 6, 2025 · v1

autoformalization ai-agents mathlib theorem-proving-ml

Read PDF arXiv abstract

TL;DR

Aria is an LLM agent for conjecture-level autoformalization in Lean, grounding terms against Mathlib definitions via the AriaScorer checker.

Abstract

Accurate auto-formalization of theorem statements is essential for advancing automated discovery and verification of research-level mathematics, yet remains a major bottleneck for LLMs due to hallucinations, semantic mismatches, and their inability to synthesize new definitions. To tackle these issues, we present Aria (Agent for Retrieval and Iterative Autoformalization), a system for conjecture-level formalization in Lean that emulates human expert reasoning via a two-phase Graph-of-Thought process: recursively decomposing statements into a dependency graph and then constructing formalizations from grounded concepts. To ensure semantic correctness, we introduce AriaScorer, a checker that retrieves definitions from Mathlib for term-level grounding, enabling rigorous and reliable verification. We evaluate Aria on diverse benchmarks. On ProofNet, it achieves 91.6% compilation success rate and 68.5% final accuracy, surpassing previous methods. On FATE-X, a suite of challenging algebra problems from research literature, it outperforms the best baseline with 44.0% vs. 24.0% final accuracy. On a dataset of homological conjectures, Aria reaches 42.9% final accuracy while all other models score 0%.

Problem

Accurate auto-formalization of theorem statements is essential for advancing automated mathematical verification, yet LLMs suffer from hallucinations, semantic mismatches, and inability to synthesize new definitions when formalizing research-level mathematics.

Approach

Aria (Agent for Retrieval and Iterative Autoformalization) performs conjecture-level formalization in Lean via a two-phase Graph-of-Thought process: recursively decomposing statements into a dependency graph and then constructing formalizations from grounded concepts. AriaScorer, a semantic checker, retrieves definitions from Mathlib for term-level grounding to ensure correctness.

Results

On ProofNet, Aria achieves 91.6% compilation success rate and 68.5% final accuracy, surpassing previous methods. On FATE-X (challenging algebra problems from research literature), it reaches 44.0% vs. 24.0% for the best baseline. On a dataset of homological conjectures, Aria achieves 42.9% final accuracy while all other models score 0%.

Papers With

Aria: An Agent For Retrieval and Iterative Auto-Formalization via Dependency Graph