Formalizing Numerical Analysis: An Agent Pipeline and Quality Audit Beyond Kernel Acceptance

Theodore Meek, Siyuan Ge, Di Qiu Xiang, Simon Chess, Vasily Ilin

cs.AI Jun 12, 2026 · v1

autoformalization ai-agents meta-study analysis

Read PDF arXiv abstract

TL;DR

Uses a Claude-driven agent pipeline to formalize a numerical-analysis ODE textbook in Lean 4 and audits formalization faithfulness beyond compilation.

Abstract

Recent work has demonstrated that coding agents can formalize entire advanced mathematics textbooks in Lean 4, yet existing efforts concentrate on branches of mathematics already well-represented in mathlib and measure success solely through kernel acceptance. We address both limitations by applying a coding agent to formalize Numerical Methods for Ordinary Differential Equations, a textbook in numerical analysis that is largely absent from mathlib, stressing the agent's capacity to develop new theory from scratch. We further introduce a systematic, reproducible three-dimensional framework for evaluating the quality of agent-produced formalizations beyond compilation: semantic correctness, Mathlib reuse, and cross-file reuse via LLM-as-judge methods. Applying this framework to our own formalization and to the released outputs of RepoProver and M2F, we uncover recurring unfaithful formalization patterns, including incomplete multi-part statements, added weakening hypotheses, and parameter restrictions, that kernel acceptance entirely obscures. Our results suggest that compilation-based metrics substantially overstate formalization quality, and we provide a reproducible audit methodology to support more rigorous evaluation of future autoformalization systems.

Problem

Agentic autoformalization has scaled to whole textbooks, but prior work targets mathematics already well-covered by Mathlib and reports success only as kernel acceptance. It is unclear how these systems behave when theory must be built from scratch, or whether compilation reflects faithfulness to the source.

Approach

An autonomous loop driven by a Python orchestrator invokes four LLM roles (Planner, Worker, Evaluator, Consultant), implemented with Claude Opus 4.6 and Sonnet, to formalize Butcher's Numerical Methods for Ordinary Differential Equations in Lean 4, committing edits only when Lean builds. A reproducible three-dimensional LLM-as-judge framework then scores semantic correctness, Mathlib reuse, and cross-file reuse, applied to their own output and to the released RepoProver and M2F formalizations.

Results

The pipeline produced proof-complete Lean for 84 of 175 statements (48%, or 54.9% counting partials) with zero sorries and zero added axioms, yet 42% of its statements diverged semantically from their natural-language sources. RepoProver and M2F diverged 18% and 25% despite near-zero sorry counts, showing that compilation-based metrics substantially overstate formalization quality.

System	Sorries	Axioms
RepoProver	5*	0
M2F	12	4
OpenMath	0	0

Table 1: Open sorries and unnecessary axioms across systems (RepoProver sorries are intentional exercises).

System	Faithful	Stronger	Different
OpenMath	54%	4%	42%
M2F	73%	2%	25%
RepoProver	82%	1%	18%

Faithfulness of formal statements vs. natural-language sources (direct judgment).

Papers With

Formalizing Numerical Analysis: An Agent Pipeline and Quality Audit Beyond Kernel Acceptance