TopoAlign: A Framework for Aligning Code to Math via Topological Decomposition

Yupei Li, Philipp Borchert, Gerasimos Lampouras

cs.CL Oct 13, 2025 · v1

autoformalization theorem-proving-ml dataset mathlib

Read PDF arXiv abstract Code

TL;DR

Aligns code repositories to Lean 4 formal statements to train Math LLMs for autoformalization, evaluated against Mathlib-derived data.

Abstract

Large Language Models (LLMs) excel at both informal and formal (e.g. Lean 4) mathematical reasoning but still struggle with autoformalisation, the task of transforming informal into formal mathematical statements. Autoformalisation helps pair the informal reasoning of LLMs with formal proof assistants which enable machine-verifiable generation and mitigate hallucinations. Yet, the performance of current Math LLMs is constrained by the scarcity of large-scale corpora, particularly those containing pairs of informal and formal statements. Although current models are trained to generate code from natural language instructions, structural and syntactic differences between these and formal mathematics limit effective transfer learning. We propose TopoAlign, a framework that unlocks widely available code repositories as training resources for Math LLMs. TopoAlign decomposes code into docstrings, main functions, and dependency functions, and reassembles these components into analogues that structurally mirror formal statements. This produces structurally aligned code data that can be used for training Math LLMs without requiring additional human annotation. We train two state-of-the-art models, DeepSeek-Math and Herald, and evaluate them on the minif2f, Putnam, and ProofNet benchmarks. TopoAlign provides substantial gains for DeepSeek-Math, improving performance by 17.77% on BEq@10 and 68.82% on typecheck@10. Despite introducing no new mathematical knowledge, our framework achieves gains of 0.12% and 1.09% for Herald on BEq@10 and typecheck@10, respectively, demonstrating that training on aligned code data is beneficial even for specialized models.

Problem

Autoformalisation (transforming informal mathematics into formal Lean 4 statements) is bottlenecked by the scarcity of paired informal-formal training data, and structural differences between code and formal math limit transfer learning from code corpora.

Approach

TopoAlign decomposes code into docstrings, main functions, and dependency functions, then reassembles these components into analogues that structurally mirror formal mathematical statements (with docstrings mapping to informal statements, main functions to formal statements, and dependencies to supporting lemmas). This produces structurally aligned training data from widely available code repositories without additional human annotation.

Results

Training DeepSeek-Math with TopoAlign improves performance by 17.77% on BEq@10 and 68.82% on typecheck@10. For the specialized Herald model, gains are 0.12% and 1.09% respectively on the same metrics. Evaluation is on miniF2F, Putnam, and ProofNet benchmarks.

Papers With

TopoAlign: A Framework for Aligning Code to Math via Topological Decomposition