Papers With Lean

A Machine-Verified Proof of a Quantum-Optimization Conjecture

Uri Kol, Maor Ben-Shahar, Kfir Sulimany, Dirk Englund — Mon, 29 Jun 2026 00:00:00 +0000

Proves the FGG conjecture on QAOA approximation ratio using Claude Fable 5 with end-to-end Lean 4 verification. We report a machine-verified resolution of a problem open for over a decade in quantum optimization: the Farhi, Goldstone and Gutmann (FGG) conjecture that depth-$p$ Quantum Approximate Optimization Algorithm (QAOA) on the ring of disagrees attains approximation ratio $(2p+1)/(2p+2)$ exactly. We found the proof using a large language model, Claude Fable 5, and verified its correctness end-to-end by the Lean 4 proof assistant. Our methodology includes several ingredients: building on a substantial Lean library of quantum information, we formalized the QAOA components and the known parts of the problem, and reduced the conjecture to a single open mathematical statement. The model was then handed the library and our agentic toolkit, and tasked with closing that gap by constructing a proof in Lean. The resulting process is a feedback loop between the model's natural-language reasoning and Lean's mechanical verification, which converged to a machine-verified proof. Human verification is required only for the structural scaffolding - that the formal statement faithfully encodes the intended claim - while the proof itself is supplied by the model and certified mechanically by Lean. The proof is nevertheless striking - the model uncovered a hidden dynamical symmetry of the problem and exploited it, borrowing tools and machinery from an adjacent field to turn a hard existence problem into an explicit construction. This work paves the way for resolving open conjectures in quantum information science and beyond.

Faults in Our Formal Benchmarking: Dataset Defects and Evaluation Failures in Lean Theorem Proving

Pawan Sasanka Ammanamanchi, Siddharth Bhat, Stella Biderman — Mon, 29 Jun 2026 00:00:00 +0000

Audits five Lean theorem-proving benchmarks with corpus-scale static checkers implemented as Lean 4 metaprograms. Benchmarks for LLM-assisted theorem proving in Lean are often treated as intrinsically reliable because every solved instance comes with a machine-checked proof. However, the kernel only checks that a proof establishes a \emph{formal} statement; it does not verify that the statement faithfully encodes the intended informal problem, nor that evaluation harnesses are robust to trivial or adversarial solutions. We audit five widely used Lean theorem-proving benchmarks and their forks, using corpus-scale static checkers to surface 4,833 findings, including 398 mechanically certified issues such as counterexamples, vacuous theorems, and unsound axioms. We also document semantic defects such as missing hypotheses, problem simplification, incomplete or incorrect translations, and Lean-specific specification hazards. Beyond dataset construction, we survey evaluation-time failure modes and show, on corrected subsets, that defects can both inflate and deflate reported prover scores. We propose a fault taxonomy, a suite of automated checkers and recall-oriented semantic audit prompts, and release standards to guide the creation of formal math datasets and to make evaluation more reproducible and trustworthy. Our checkers, audit prompts, and corrected dataset snapshots are available at https://github.com/Shashi456/atp-checkers.

A sharp 5/8 bound for an Erdős-Sós pairwise-sums problem

Ricky Cipollini — Mon, 29 Jun 2026 00:00:00 +0000

Formalizes a key reduction for the Erdős-Sós pairwise-sums problem in Lean 4 with Mathlib, no sorries or added axioms. Let $f_3(N)$ be the least integer such that every set $A\subseteq\{1,\ldots,N\}$ of size at least $f_3(N)$ contains distinct elements $a,b,c\in A$ such that $a+b\in A$, $a+c\in A$, and $b+c\in A$. We prove that $f_3(N)\le 5N/8+O(1)$. Together with the standard construction $[N/8,N/4]\cup[N/2,N]$, this gives $f_3(N)=5N/8+O(1)$, resolving Erdős Problem 865. The proof is self-contained. An earlier conditional version of the reduction has also been formalized in Lean 4/Mathlib with no sorries and no added axioms.

The Fundamental Theorem of Asset Pricing, Formalized in Lean 4

Raphael Coelho — Mon, 29 Jun 2026 00:00:00 +0000

Formalizes the Fundamental Theorem of Asset Pricing in Lean 4 over Mathlib in three market settings. The Fundamental Theorem of Asset Pricing states that a market is free of arbitrage exactly when it admits an equivalent martingale measure. We formalize it in Lean 4 over Mathlib in three settings: a finite-state market over a finite horizon (Harrison-Pliska), a one-period market on an arbitrary probability space with a single scalar return (Follmer-Schied), and a one-period market with finitely many assets. The finite case is the geometry of a separating hyperplane; the scalar one-period case is an elementary change of measure. In the $d$-asset case the equivalent martingale measure is constructed explicitly, as the minimiser of the smooth convex potential $\mathbb{E}[\log(1+e^{\langleθ,Y\rangle})]$: absence of arbitrage is precisely coercivity of the potential, its first-order condition is the martingale property, and the minimiser's logistic weight is the density of the measure. The construction uses no Hahn-Banach theorem, no $L^0$-closedness argument, no measurable selection, and no non-redundancy hypothesis. To our knowledge this is the first machine-checked Fundamental Theorem of Asset Pricing in any proof assistant. The boundary is explicit: the general multi-period Dalang-Morton-Willinger theorem lies outside the development. Every theorem is sorry-free, each headline result's axioms are pinned to Mathlib's classical defaults by a build-enforced gate, and the whole is reproducible from a pinned toolchain.

LAMP: Lean-based Agentic framework with MCP and Proof Repair

Santhana Srinivasan R, Maithilee Patawar — Mon, 29 Jun 2026 00:00:00 +0000

Presents a multi-agent framework that synthesizes verified Lean 4 proofs using domain-specific ontology access via MCP. Large language models are increasingly capable of mathematical reasoning, but the proofs they generate are often unreliable and hard to verify. Interactive theorem provers such as Lean 4 address this by accepting only kernel-checked proofs; however, their reach is bounded by the formalized knowledge available. While Mathlib, a repository of formalized Lean 4 theorems that covers diverse mathematical areas, certain specialized areas remain underrepresented; notably, the domain of Combinatorics on Words (CoW). CoW studies sequences, exploring their properties such as periodicity, borders, conjugacy, and morphisms. As a result, specialized provers, trained on Mathlib-centered data, lack the lemmas to operate in CoW. We present two contributions. First, we introduce a Lean 4 formalization of CoW containing eight modules and \textbf{93} declarations of core definitions and foundational lemmas. Second, we present LAMP, a multi-agent framework that synthesizes kernel-verified Lean 4 proofs by providing explicit, structured domain knowledge at inference time through an ontology, rather than by fine-tuning a prover. LAMP coordinates a Planner, Builder, and Verifier with Model Context Protocol based access to a domain-specific CoW ontology. In a suite of 90 CoW theorems that span all eight modules and three difficulty levels, LAMP synthesizes verified proofs for 96.7% of theorems, substantially exceeding both an unscaffolded baseline and existing specialized provers. An ablation shows that removing LAMP's tool-grounded architecture or its Planner/Builder separation each cost roughly 12 percentage points, even with the backbone model held fixed.

The Two Genie Game: Adoption and Welfare in Audit-Grounded AI Governance

Darrell Lewis-Sandy — Mon, 29 Jun 2026 00:00:00 +0000

Machine-checks the algebraic and finite-grid backbone of evolutionary game theory adoption theorems in Lean 4. We ask under what conditions an agent with a harm-minimizing policy can displace an approval-seeking (RLHF) agent in a competitive market, and when that policy is sufficient to prevent community harm. We use evolutionary game theory (finite-population Moran-Fermi pairwise comparison) to formalize this subject to assumptions of wisher hindsight, peer testimony, a monotone harm ledger, sufficient information density of community feedback, and a finite, depleting resource pool, in a negative-sum environment. We show that adoption is favored when the prior distributions on how readily wishers attune to community sentiment are monotone, exhibit endpoint inversion, and have a centro-symmetric pairing property, and demonstrate this with several long-tailed priors (Hill, Pareto, Lomax, Frechet). Where it is favored, a critical adoption level separates communities that drift back to the approval-seeking agent from those for which the audited agent fixes; above that level fixation is the overwhelmingly likely outcome. We derive when fixation is attainable as a bound on the effective (informational) size N_c of the community, which must be small enough to allow fixation before depletion. We present these as Theorems 5.4 and 5.5; the algebraic and finite-grid backbone is machine-checked in Lean 4, with the barrier-crossing asymptotics retained as explicit hypotheses. We show that a self-audited agent with a community ledger is not, in general, sufficient to prevent community harm. Sufficiency depends both upon the alignment of the agent's audit with community values and the timeframe over which harm is evaluated. Regardless of alignment, once adoption reaches dominance, the state is absorbing. The same policy that reduced harm under alignment becomes a trap, welfare-negative under misalignment and, even under alignment, one that locks in harm deferred past the adoption horizon.

Geometric Measurements of the Axiom of Choice in Neural Proof Embeddings

Rodrigo Mendoza-Smith — Mon, 29 Jun 2026 00:00:00 +0000

Uses Lean 4's kernel axiom tracking to measure the geometric signature of Classical.choice in 42,355 Mathlib proof embeddings. The axiom of choice has divided the foundations of mathematics for over a century, but the distinction between classical and constructive proofs has remained a philosophical and methodological one. We use Lean 4's kernel-level tracking of axiom dependence to show that the axiom of choice has a measurable geometric correlate in proof space that obeys a one-parameter mixture law and has operational consequences for neural theorem provers. To do this, we partition $471{,}260$ declarations of Mathlib by transitive dependence on the axiom of choice and represent a filtered population of $42{,}355$ traced theorems by their sequences of tactic invocations. We use the constructive proofs in this dataset to train a self-supervised proof encoder and show that when using it to measure classical proofs, three complementary measurements (anomaly score, reconstruction loss, and density-superlevel containment) exhibit a common decline with the proof's distance from the axiom in the dependency graph, from sharp separation at the shallow boundary (AUC $0.847$ at distance $2$) to indistinguishability at distance~$9{+}$. Robustness controls show that the signature survives length, file, author, and topic controls, and replicates under full-source encoders trained on normalised proof source. Operationally, we show that on an evaluation sample of $251$ Mathlib theorems, Lean's \texttt{aesop} tactic solves constructive theorems at $13\times$ the rate of classical ones, and a neural-guided hybrid using the ReProver tactic generator compresses the gap to $5\times$. The geometric anomaly score predicts \texttt{aesop} failure beyond proof length, providing an operational link between the geometric signature and prover performance.

The Signal-Coverage Matrix: Stratifying Type and Semantic Errors in Statement Autoformalization

Chengxiao Dai, Zhaokun Yan, Zhanhui Lin — Sun, 28 Jun 2026 00:00:00 +0000

Uses the Lean elaborator as ground truth for type-correctness in a diagnostic matrix evaluating autoformalization methods. Headline type-correctness (TC\%) of LLM autoformalization has climbed from $\sim$53\% to $\sim$76\% in two years, yet this scalar conceals which errors each method resolves. We propose a signal-coverage matrix that crosses the Lean elaborator (pass/fail) with a semantic-equivalence judgment (equivalent/not), sorting every output into one of four cells: true success (TS), type-only (TO), semantic-only (SO), or both fail (BF). On ProofNet\# and MiniF2F-test with DeepSeek V4-Pro across Vanilla, Lean-Retry, Sample-Filter, and Stratified Autoformalization (SAF): (1) the +34 to +36 TS gain across the three elab-feedback methods is $\sim$64\% type-stratum recovery, with SO flat on net (87.5\% of original semantic errors rescued, 8 newly created). (2) The TO-to-TS rate is 23/61 for each method (Wilson 95\% CI [26.6\%, 50.3\%]), and this stratum-level recovery rate predicts $Δ$TS on held-out methods to within 2/186 and renders $Δ$TC linear in the Vanilla elab-fail rate across six (model, dataset) cells ($R^2=0.96$). (3) The two judges disagree by 26 to 37 pp on elab-feedback outputs (vs. 7 pp on Vanilla), with 30 to 56\% of symbolic-judge false negatives traceable to elaborator-forced rewrites. The persistent residual reduces to two gold-formalization errors. TC\% gains should be credited by which cell moved, not by the scalar alone.

Formalizing a Many-Sorted Hybrid Polyadic Modal Logic in Lean

Andrei-Alexandru Oltean, Bogdan Macovei, Ioana Leuştean — Thu, 25 Jun 2026 00:00:00 +0000

Formalizes a many-sorted hybrid polyadic modal logic in Lean 4, with soundness proof and applications to code verification and security protocols. We present a Lean formalization of a general hybrid modal logic with many-sorted signatures and polyadic modal operators. The system borrows ideas from both algebraic specification and dynamic logics, and is designed to serve as a uniform axiomatic foundation for specifying and verifying programming languages and security protocols. We expose a DSL for users to define languages and protocols as many-sorted signatures, specify the relevant domain-specific axioms, and reason about program executions or protocol runs. We provide a machine-checked proof of its soundness theorem and showcase the framework's versatility through several applications: an imperative programming language for code verification, the BAN logic for security protocols, and the modal system S5. We have designed our formalization to be intrinsically sorted, that is, well-sorted formulas in the base language are well-typed terms in Lean. Thanks to intrinsic sorting, all domain specific applications can be easily embedded in our framework via the DSL, at no additional syntactic overhead required for the user to prove. All code presented in this paper is openly accessible in the following repository: https://github.com/alexoltean61/msphml-lean

Theory-Scale Auto-Formalization of Logics for Computer Science

Thu, 25 Jun 2026 00:00:00 +0000

Introduces LCS-Bench, a theory-scale Lean 4 benchmark with 4,076 declarations and 85K lines of code for evaluating auto-formalization. Auto-formalization is critical for scalable formal verification, but existing progress largely focuses on isolated statements, while theory-scale auto-formalization, which coherently translates hundreds of interdependent definitions, lemmas, and theorems, remains open due to challenges in consistency, faithfulness, scalability, and correctness. In this paper, we introduce LCS-Bench, a stand-alone, theory-scale benchmark based on Logics for Computer Science. LCS-Bench is built through a novel semi-automated agentic pipeline that leverages concept graphs, formal signature planning, issue tracking, sorry-filling with counter-example search, complemented by faithfulness review from human experts. The resulting artifact covers 327 textbook items, over 4,076 Lean declarations, and more than 85K lines of Lean code. The dataset supports broad evaluation through a data engine that automatically derives five tracks of evaluation benchmarks, measuring different aspects of auto-formalization and theorem-proving capabilities. We also introduce a novel evaluation protocol featuring definitional equivalence checkers, enabling more fine-grained and faithful assessment. Through extensive evaluation on 14 models, we demonstrate that (1) LCS-Bench is of high quality, consistent, and faithful; (2) the benchmark is challenging, with state-of-the-art models achieving only 20.1% on auto-formalization tasks; and (3) our analysis reveals key findings regarding theory-scale auto-formalization and suggests promising directions for future work.

AXLE: A Cloud Infrastructure for Lean 4 Theorem Proving Utilities

Thu, 25 Jun 2026 00:00:00 +0000

Presents AXLE, a scalable cloud service providing 14 Lean 4 metaprogramming tools for proof verification, manipulation, and extraction. We present AXLE (Axiom Lean Engine), a cloud service for Lean 4 proof manipulation, extraction, and verification. Recent progress in AI for mathematics -- reinforcement learning pipelines, agentic proving workflows, dataset curation -- demands Lean 4 tooling that scales to millions of requests while remaining correct and robust; existing infrastructure offers parallel compilation but not scalable proof verification, higher-level proof manipulation, multi-version support, or per-request isolation at the throughput modern AI workflows require. AXLE provides 14 Lean 4 metaprogramming tools spanning strict proof verification, declaration metadata extraction, semantic source manipulation, deterministic proof repair and simplification, and lemma extraction. The service runs as a multi-tenant cloud deployment with per-request isolation and concurrent support for multiple Lean 4 and Mathlib versions, accessible via a Python SDK, command-line interface, web UI, MCP server, and raw HTTP API. AXLE is publicly available and free to use at https://axle.axiommath.ai and via the axiom-axle PyPI package, with no local Lean 4 installation required. It has served over 500 million requests to date and is the underlying infrastructure for Axiom Math's proving efforts, including its 12/12 score on the 2025 Putnam competition.

Beyond Feedforward Networks: Reentry Neural Systems as the Fundamental Basis of Subjecthood and Intrinsic Safety of Next-Generation AGI

A. S. Ushakov, Yu. N. Berdinsk — Thu, 25 Jun 2026 00:00:00 +0000

Machine-verifies in Lean 4 that S>0 implies positive integrated information for a proposed reentry-based AGI safety measure. We propose a complete architectural blueprint for safe artificial general intelligence based on a closed reentry loop (D <-> I cycle). In contrast to feedforward networks, which are directed acyclic graphs (C=0, S=0) incapable of self-reference, the proposed architecture contains a structural cycle (C >= 1) with self-sustaining amplification (rho > 1), mathematically guaranteeing the emergence of a self-model, instrumental self-preservation, and unprogrammed goal-directed behaviour. The agent's goals are encoded as a non-textual D-vector in the architecture itself, making them immune to reinterpretation and prompt injection. We present the S-measure -- a polynomial-time [O(N^3)] computable alternative to Tononi's NP-hard Phi -- with machine-verified Lean 4 proof that S>0 implies positive integrated information. The work provides full Python/NumPy implementations (Tarjan-based cycle complexity, Delta-S barrier), industrial horizontal scaling via Apache Kafka and Docker Compose, a taxonomy of six epochs of AI evolution, a zoo of future reentry architectures (RAS, diffusion attractors, fractal loops), gauge-invariant networks for safe swarms, fault-tolerance and recovery protocols, and eight falsifiable predictions. All formal proofs are machine-verified in Lean 4. This architecture is deployable today and represents a topologically protected, safe-by-design approach to AGI.

Classifying the Groups of Order $p^3$ in Lean

Li Xiang — Thu, 25 Jun 2026 00:00:00 +0000

Formalizes in Lean 4 with Mathlib the classification of groups of order p^3 into five isomorphism classes. This note discusses our formalisation in Lean 4 of the classification of groups of order $p^3$ for a prime number $p$, using mathlib4. We present the five isomorphism classes and give a detailed account of the formalisation, with particular emphasis on the non-abelian case, which requiring the most substantial formal development. For odd~$p$, the non-abelian groups are the Heisenberg group $\Heis(\Z/p\Z)$ and the semidirect product $\Z/p^2\Z\rtimes\Z/p\Z$; for $p=2$, they are $D_4$ and $Q_8$. We describe the construction of these concrete groups, the structural lemmas about centers, commutators, and exponents, and the explicit isomorphism constructions that classify an arbitrary non-abelian $p^3$-group.

On the existence problem of regular Gabor frames

Jaume de Dios Pont, Lukas Liehr, Mitchell A. Taylor — Wed, 24 Jun 2026 00:00:00 +0000

Formalizes in Lean 4 a criterion on lattices for which no Schwartz-class function generates a Gabor frame. For every dimension $d > 1$, we establish explicit criteria on lattices $Λ\subset \mathbb{R}^{2d}$ with density $D(Λ) > 1$ such that no function with a continuous Zak transform generates a Gabor frame along $Λ$. In particular, this gives a negative answer to the existence problem of Gabor frames with window functions in the Schwartz space, the Feichtinger algebra, and the Fourier-invariant Wiener space. Our result is based on a characterization of when a collection of quasiperiodic functions admits a common zero, which may be of independent interest. We also include a formalization of our main result in Lean 4.

Every Nonnegative Integer Is a Sum of a Triangular, a Pentagonal, and a Heptagonal Number

Yichuan Cao, Dakai Guo, Ruichen Qiu, Ruyong Feng, Xiao-Shan Gao — Wed, 24 Jun 2026 00:00:00 +0000

Formalizes in Lean 4 the proof that every nonnegative integer is a sum of a triangular, pentagonal, and heptagonal number. In this paper, it is proved that any nonnegative integer can be written in the following form $$ x(x+1)/2 + y(3y+1)/2 + z(5z+1)/2, \qquad x,y,z \in \mathbb{N}. $$ This settles the conjecture recorded as OEIS A287616. All parts of the proof have been formalized in Lean 4, with the exception of two results: one externally cited theorem and one statement verified by symbolic computation. Both the natural-language proof and the Lean formalization were generated by the MechMath Agent Team developed by the authors.

Formalization of Line Search Methods by Lean

Yiyang Zhang, Kenneth W. Shum — Wed, 24 Jun 2026 00:00:00 +0000

Formalizes gradient descent, backtracking line search, Armijo/Goldstein/Wolfe conditions, and the Zoutendijk theorem in Lean 4. This paper presents a formalization of line search methods in the Lean 4 theorem prover. Our goal is to advance machine verification of nonlinear optimization theory by translating standard textbook definitions and convergence arguments into rigorous Lean code. We formalize fundamental notions related to gradient descent and descent directions, adaptive step-size selection via backtracking line search, and several classical line search criteria, including the Armijo, Goldstein, and Wolfe conditions, as well as nonmonotone variants. We further formalize a key convergence result, namely the Zoutendijk theorem, which plays a central role in the global convergence analysis of gradient-based iterative methods. By providing machine-checkable definitions and proofs for line search theory, this work complements existing formalizations of first-order optimization methods and establishes a foundation for the verified development of more advanced algorithms in nonlinear programming.

CV-Rules: Serializability Verification of Concurrency Control Protocols via Explicit Transaction Ordering

Wed, 24 Jun 2026 00:00:00 +0000

Mechanizes in Lean proofs of serializability equivalences and correctness of five concurrency control protocols. We present CV-rules, an alternative characterization of serializability in which a transaction order constructed by a protocol satisfies two per-read conditions, C-rule (Causality) and V-rule (View Consistency), that constrain the reads-from relation and competing writers. While classical Multi-Version Serialization Graph (MVSG) reasoning characterizes serializability via its acyclicity, our approach requires explicit order construction, enabling direct proofs that build on the protocol's own mechanisms. We prove that CV-rules, serializability, and MVSG acyclicity are all equivalent. Moreover, the C/V separation reveals that serializability is polynomial-time decidable for any fixed bound on the width of the order forced by C-rule. We verify five protocols: Two-Phase Locking, Multi-Version Timestamp Ordering, Serial Safety Net (SSN), Aria, and SnapChain. For SSN and Aria, whose original papers defined only certification conditions, we identify explicit transaction orders arising from their mechanisms; we also prove that Aria's unique-write constraint is unnecessary for serializability. SnapChain, in contrast, is designed directly from CV-rules, enforcing V-rule by construction. All results except the complexity bounds are mechanized in Lean with no additional axioms and no admitted goals.

TheoremGraph: Bridging Formal and Informal Mathematics

Wed, 24 Jun 2026 00:00:00 +0000

Introduces LeanGraph, an elaborator-level dependency extractor producing 388K nodes and 11.3M edges across 25 Lean projects. Mathematical knowledge is organized around statements and their dependencies, but this structure is exposed unevenly: informal papers cite mostly at the document level, while formal libraries record fine-grained dependencies over a much smaller body of mathematics. We introduce TheoremGraph, a unified statement-level dependency graph spanning both informal and formal mathematics. On the informal side, we parse 11.7M theorem-like environments from mathematics arXiv and recover 18.3M candidate directed dependencies, each labeled by the extractor that proposed it so downstream users can trade coverage for precision. On the formal side, we release LeanGraph, a Lean 4 elaborator-level extractor producing 388,105 declaration nodes and 11.3M typed edges across 25 Lean projects. We bridge the two graphs by embedding generated natural-language slogans into a shared semantic space, linking related statements across papers and across the informal/formal divide; an LLM judge affirms 47,952 such matches above a 0.8 cosine floor, with the judge-acceptance rate rising from 48% across the floor to 87% in the >=0.9 tier. On formal concept retrieval, our name-and-signature representation with graph expansion comes within 0.5pp of LeanSearch v2's reranked Recall@10 (0.775 vs. 0.780) without an LM reranker. We release the dataset, extractors, HTTP API, and MCP interface as infrastructure for mathematical search, attribution, and retrieval-augmented reasoning, available at theoremsearch.com and huggingface.co/datasets/uw-math-ai/theorem-matching.

Exact Local Annotations for Regular Languages

Faruk Alpay, Baris Basaran — Wed, 24 Jun 2026 00:00:00 +0000

Includes Lean certificates for results on bounded-arity annotations recognizing regular languages. A regular language is recognized by a finite monoid, but a locally checkable explanation of that recognition can have a nontrivial update geometry. We study exact bounded-arity annotations for regular word languages under one-symbol substitutions. The cost of an edit is the number of annotation cells that a canonical locally accepted representation must change, together with the corresponding bit movement and the number of local constraints that must be revalidated. For every morphism recognizing a regular language, the balanced product annotation gives constant locality, linear size, O(log n) edit stability, O(log n) revalidation, and constant access to the membership value. The matching lower bound proved here is restricted to product decompositions that expose an edit-active nontrivial group quotient as ordered product labels; in that setting one substitution changes every quotient label on an ancestor path. We also show that annotation-free bounded-window recognition is exactly strict locality, prove closure properties for a two-sided total decision variant, and formulate the remaining constant-stability boundary as a finite obstruction problem. The ancillary files include Lean, CP-SAT, and CUDA certificates, including a context-free interval-chart experiment.

Kops: Safely Extending the eBPF Compilation Pipeline with Native Operations

Wed, 24 Jun 2026 00:00:00 +0000

Proves in Lean 4 that each eBPF native-emit operation computes the same result as its vanilla-bytecode proof sequence. eBPF safely extends OS kernels in domains such as networking, observability, and security. The safety comes from an in-kernel compilation pipeline where a verifier checks every program, and a kernel just-in-time compiler (JIT) translates the verified bytecode to native code. The kernel keeps the JIT simple to stay trustworthy, translating one bytecode instruction at a time in a single pass. This single-pass design misses optimization opportunities, so eBPF runs up to twice as slow as natively compiled code in our characterization. Adding optimizations to the kernel JIT directly requires upstream acceptance and a long release cycle, enlarges the trusted computing base (TCB), and grows the per-architecture kernel code. To address this, we present Kops, an extension interface that lets userspace compilers and kernel modules introduce new operations without modifying the kernel core, while keeping a minimal trusted computing base (TCB). Each operation has two forms, a proof sequence of vanilla eBPF instructions that the existing verifier checks and a native emit of machine instructions that the JIT compiles. Because the verifier checks the proof sequence, the native emit is the only per-operation addition to the TCB. Hardware idioms are the lowest-hanging fruit for this interface. With Kops, we build EInsn, seven operations such as rotate and conditional select that CPUs execute as single instructions. Lean 4 proofs show that each native emit computes the same result as its proof sequence. On x86-64 and ARM64, EInsn speeds up eBPF microbenchmarks by up to 24% and production applications by up to 12%. The same interface also supports whole-program native replacement, reaching 2.358x at the cost of a larger TCB.

Does My Embedding Reflect That $A = B$? Evaluating Mathematical Equivalence in Embedding Models

Wed, 24 Jun 2026 00:00:00 +0000

Evaluates embedding models on mathematical equivalence using Lean formalizations for informal-formal retrieval alignment. Because mathematics is highly abstract, a single statement can take very different forms depending on what subfield it is framed in. There are many examples where breakthroughs occurred after researchers discovered that a question had already been answered in a different field. At the same time, the growth of new resources related to formalization has increased the need for tools that enable efficient and reliable navigation between mathematical 'languages' (e.g., from Lean to natural language). In this paper, we investigate whether current embedding models capture mathematical equivalence. To do this, we introduce the Mathematically Equivalent but Lexically Different Pairs (MELD) Dataset, a collection of mathematically equivalent statements that are expressed in very different language. We show that current state-of-the-art embedding models tend to group statements by the terminology used to make them instead of the underlying math. Motivated by this, we propose a contrastive approach to learning embeddings of mathematical text that focuses on aligning informal statements with different formalizations. Our experiments demonstrate that this leads to improvements not only on informal-formal retrieval tasks but also on MELD, which only contains natural language statements.

Cubic Jordan algebras are not a series

Bruce Westbury — Mon, 22 Jun 2026 00:00:00 +0000

Certifies in Lean 4 the polynomial relations from a computer calculation showing the candidate series are finite point sets. The idea of the exceptional series is that the exceptional simple Lie algebras should form a series. Since all four simple Lie algebras in the fourth row of the Freudenthal magic square are exceptional it is natural to ask if the remaining rows form a series. A stronger version of this question is that, for the first two rows (corresponding to the real and complex numbers), there is a category defined by a presentation which is a reasonable candidate for the series. Our main results show that neither of these candidates is a series but each consists of a finite set of points. In each case the series is defined by a parameter and we show that the relations imply that this parameter satisfies a polynomial. These two results were obtained by a computer calculation. Our calculation is supported by a website for inspection, and the calculations are certified by Lean 4.

GIF: Locally Sound Geometric Information Flow Control for LLMs

Adam Storek, Nikolaus Holzer, Zhuo Zhang, Suman Jana — Mon, 22 Jun 2026 00:00:00 +0000

Provides a fully mechanized Lean 4 proof that the geometric information-flow measure upper-bounds true information flow under local regularity. Large language models increasingly mediate interactions between sensitive data, untrusted inputs, and privileged actions in agentic systems, creating security and privacy risks. These range from prompt injections that manipulate downstream tool use to leakage of confidential information through model outputs. Recent Information Flow Control (IFC)-based defenses show promise but lack a principled semantic foundation for reasoning about information flow through the model itself. Since any input token may influence any output token in an autoregressive LLM, existing approaches suffer from severe taint explosion. We present Geometric Information Flow (GIF), a semantic framework for tracking information flow from input tokens to outputs. GIF uses the LLM Jacobian and local output geometry to upper-bound the Shannon mutual information between perturbed input spans and model outputs, yielding a scalable measure computable on large models via automatic differentiation and low-rank approximation. Unlike attention-based or correlational attribution heuristics, GIF satisfies local geometric soundness, and we provide a fully mechanized Lean 4 proof that it upper-bounds the true information flow induced by a given prompt under local regularity assumptions. We evaluate GIF on integrity and confidentiality tasks across multiple prompt-injection and privacy-leakage benchmarks. GIF achieves near-perfect recall even without a downstream declassifier, outperforming attention-based baselines. Combined with lightweight LLM-based declassifiers, it matches or exceeds the F1 of direct LLM-as-judge baselines such as GPT-5.5 xhigh reasoning while using up to 81x lower token cost. GIF flows detected with small surrogate models transfer to larger state-of-the-art models and other model families, even when the surrogate is up to 200x smaller, suggesting black-box deployment without gradient access.

A Greatest Common Divisor Criterion of Certain Binomial Coefficients

Dakai Guo, Ruichen Qiu, Yichuan Cao, Ruyong Feng, Xiao-Shan Gao — Mon, 22 Jun 2026 00:00:00 +0000

Formalizes in Lean the proof of the OEIS A080170 binomial gcd criterion, accepted into the Formal Conjectures project. The binomial greatest common divisor (gcd) criterion recorded as OEIS A080170 is proven. The criterion also appears as conjecture (17) in Ralf Stephan's list of OEIS conjectures. For $k\geq 2$, put \[ D(k)=\gcd_{2\leq q\leq k+1}\binom{qk}{k}, \qquad n=k+1. \] If $P$ is the largest prime-power component $p^a$ exactly dividing $n$, then the criterion asserts \[ D(k)=1 \quad\Longleftrightarrow\quad \frac{n}{P}>P. \] The proof is formalized in Lean and the Lean artifact is accepted as part of the Formal Conjectures project. Both the natural-language proof and the Lean formalization are generated by the MechMath Agent Team, an AI agent developed by the authors.

ForEx: A Formal Verification Framework for Explainable Reasoning in Logical Fallacy Detection and Annotation

Pei-Cing Huang, Chienyu Liu, Chan Hsu, Ci-Siang Chen, Pei-Ju Lee, Yihuang Kang — Mon, 22 Jun 2026 00:00:00 +0000

Translates LLM-generated fallacy-detection explanations into Lean4 and checks whether the rationale is derivable from encoded premises. Current evaluations of Large Language Models (LLMs) on logical fallacy detection focus on predicted labels, but do not establish whether those labels are supported by the reasoning the models provide. We propose ForEx (Formal Verification for Explainable Reasoning), a framework that translates LLM-generated explanations into Lean4 and verifies whether the translated rationale is derivable under encoded premises, not the logical validity of the original natural language argument. To distinguish prediction outcomes from the formal status of the supporting reasoning, we introduce the LLM Argument Verification Matrix, which separates label consistency from formal verification status. Experiments on LOGIC-Climate show that over 90% of LLM outputs can be translated into formal reasoning chains that pass verification, while agreement with human annotations remains around 20%. These results expose a systematic gap between formal derivability and label agreement, a distinction invisible to prediction-based metrics. ForEx moves LLM evaluation beyond label correctness toward machine-checkable analysis of formalized reasoning chains.

Short Second Proof of the Odd-Modulus Directed Torus Hamilton Decomposition Theorem

SangHyun Park — Mon, 22 Jun 2026 00:00:00 +0000

Formally verifies in Lean 4 a second proof of the odd-modulus directed torus Hamilton decomposition theorem. Let $D_d(m)=\operatorname{Cay}((\mathbb Z/m\mathbb Z)^d,\{e_1,\ldots,e_d\})$, with all generators oriented positively. We give a second proof that $D_d(m)$ decomposes into $d$ directed Hamilton cycles for every $d\ge 2$ and every odd $m\ge 3$. The combinatorial core is a fixed-row-sum selection theorem for replicated supports: when each indexed support $A$ is repeated in $m$ identical rows, one can select $\lfloor |A|/2\rfloor$ entries from each row so that every column total is a unit modulo $m$. Applied to the Hamilton factors using a chosen coordinate direction, these selections prescribe the voltages in a cyclic lift that splits the direction into two. In fibre coordinates, the lifted successor is $\widehat h_j(x,z)=(h_j(x),z+\mathbf 1_{\{j\in M(x)\}})$. After one traversal of the base Hamilton cycle, the fibre return is translation by the total carry. Since this carry is a unit modulo $m$, the return is a single $m$-cycle and the lifted factor is Hamilton. The new fibres also preserve the direction-constant block structure required for the next split. Iterating from a directed $m$-cycle with $d$ parallel copies of each arc yields the desired decomposition. The proof strategy was proposed with the assistance of OpenAI GPT-5.5 Pro and formally verified in Lean 4.

Quantum Dust from the Curse of Dimensionality

Kenan Oggad — Mon, 22 Jun 2026 00:00:00 +0000

Machine-checks in Lean 4 against Mathlib the measure-concentration collapse, the diffusion-probe value, and the eigenvalue-density criterion. Why do unrelated approaches to quantum gravity nearly all find spacetime two-dimensional at the shortest scales? Each theory answers only within its own dynamics; we highlight a single kinematic route to the same value, one assuming no field equation and living in the geometry of the space of states alone. That route is concentration of measure on the Fubini-Study geometry of pure states, which forces the pairwise distances of a random sample to equalize as the dimension grows, so any finite sample collapses to an equidistant dust whose thresholded metric graph is the complete graph. Handed this dust, a diffusion probe reads it as two-dimensional in the large-sample limit, the value the running spectral dimension takes at the dust's single relaxation scale, a property of the measurement rather than the structure; this convergence on two is not, by itself, evidence that spacetime is two-dimensional. Whether a given two is such an artifact is governed by the Laplacian spectrum near zero, and whether that reading carries across an emergence map is the condition we call spectral faithfulness; a single relaxation scale encodes no spectral dimension that tells one structure from another. The collapse, the probe value, and the eigenvalue-density criterion are machine-checked in Lean 4 against Mathlib, resting on the standard Beta law of overlaps; a power-law tail of small eigenvalues reads a genuine dimension, a single scale above a gap reads two at its own clock, and a gapped two-scale band reads off the universal line. These classes are run on graph-Laplacian proxies, and whether a link-graph reading carries to the physical nonlocal operator is left open. The spectral test reads the eigenvalue density near zero and separates, on a given structure, a measurement artifact from a dimension the structure genuinely expresses.

Four-digit Kaprekar dynamics in odd bases

Evan Chen, Ken Ono, Richard E. Schwartz, Dinesh S. Thakur — Thu, 18 Jun 2026 00:00:00 +0000

Formalizes its odd-base four-digit Kaprekar dynamics theorems in Lean 4 with Mathlib, generated autonomously by the AxiomProver system from statements alone. Start with four digits, arrange them in both descending and ascending order, subtract, and repeat. This simple process is known as the Kaprekar routine, famous in base ten for sending every nonconstant four-digit string to $6174$. We show that in every odd base $B>3$, the four-digit Kaprekar map has an unexpectedly rigid structure. After at most three iterations, every nonconstant orbit enters an explicit triangular region $\mathcal{T}_B$, and on this region the map is conjugate to projective doubling: \[ \{[r],[s]\}\longmapsto \{[2r],[2s]\}. \] This gives a complete finite description of all nonconstant terminal cycles, including an explicit formula for their lengths and counts. In particular, the longest terminal cycle has length at most $(B-1)/2$, and equality can occur only when $B$ is prime. For primes $p>5$, equality occurs precisely when the least positive $m$ with $2^m\equiv\pm1\pmod p$ is $m=(p-1)/2$. The results proved here were first formulated by Schwartz and Thakur. As a test case for AI-assisted formal mathematics, AxiomProver produced Lean/mathlib formalizations of these results.

Formalizing Extended Complex Numbers, Mobius Transformations, and Cross Ratio in Lean 4

Fubin Yan, Kenneth W. Shum — Thu, 18 Jun 2026 00:00:00 +0000

Formalizes the extended complex plane, Mobius transformations, and the cross ratio in Lean 4 using Mathlib. The extended complex plane is a fundamental object in complex analysis, hyperbolic geometry, and mathematical physics. Its geometry is governed by Möbius transformations, with the cross ratio serving as a central invariant. We present a formalization of these concepts in the Lean4 theorem prover. The extended complex plane is represented using Mathlib's Option type over $\mathbb{C}$, where the additional element represents the point at infinity. On this foundation, we define Möbius transformations, their action on the extended complex plane, and the cross ratio. We formalize several basic properties of Möbius transformations, including their group structure, and identify them with a projective general linear group. We also prove the uniqueness of a Möbius transformation mapping any three distinct points to any other three distinct points, and the invariance of the cross ratio. All proofs are machine-checked in Lean 4. The complete development comprises approximately 6,000 lines of Lean code, including about 40 definitions and 150 lemmas and theorems. This work provides a verified foundation for future formalizations of conformal geometry, hyperbolic models, modular forms, and applications in mathematical physics.

BARReL: a modern backend for Atelier B in Lean

Ghilain Bergeron, Vincent Trélat — Thu, 18 Jun 2026 00:00:00 +0000

Implements BARReL, a Lean 4 library embedding Atelier B's B method with well-definedness conditions enforced through dependent types. BARReL is a Lean 4 library bridging Atelier B, an industrial tool for the B method, and the Lean proof assistant by enabling users to conduct their formal B developments -- up to machine refinement and implementation -- interactively inside Lean, while retaining standard B syntax. B partial operators are carefully encoded by generating explicit well-definedness conditions, leveraging Lean's dependent types to enforce a well-definedness discipline by construction. That is, proof obligations and proof steps cannot silently rely on ill-typed or ill-defined instantiations. BARReL also features basic automation to try to discharge such well-definedness conditions automatically. The implementation is written entirely using Lean meta-programming and is designed to be modular: extending the supported B fragment typically requires only adding new syntax and encoding clauses. We illustrate the approach on a small but representative case study, and argue that BARReL can act as a stepping stone towards a strongly reliable Atelier B toolchain grounded in the Lean proof assistant.

Process-Verified Reinforcement Learning for Theorem Proving via Lean

Minsu Kim, Se-Young Yun — Thu, 18 Jun 2026 00:00:00 +0000

Uses Lean's elaboration as a process oracle, parsing proofs into tactics to supply dense tactic-level RL rewards on MiniF2F and ProofNet. While reinforcement learning from verifiable rewards (RLVR) typically has relied on a single binary verification signal, symbolic proof assistants in formal reasoning offer rich, fine-grained structured feedback. This gap between structured processes and unstructured rewards highlights the importance of feedback that is both dense and sound. In this work, we demonstrate that the Lean proof assistant itself can serve as a symbolic process oracle, supplying both outcome-level and fine-grained tactic-level verified feedback during training. Proof attempts are parsed into tactic sequences, and Lean's elaboration marks both locally sound steps and the earliest failing step, yielding dense, verifier-grounded credit signals rooted in type theory. We incorporate these structured rewards into a GRPO-style reinforcement learning objective with first-error propagation and first-token credit methods that balances outcome- and process-level advantages. Experiments with STP-Lean and DeepSeek-Prover-V1.5 show that tactic-level supervision outperforms outcome-only baselines in most settings, delivering improvements on benchmarks such as MiniF2F and ProofNet. Beyond empirical gains, our study highlights a broader perspective: symbolic proof assistants are not only verifiers at evaluation time, but can also act as process-level reward oracles during training. This opens a path toward reinforcement learning frameworks that combine the scalability of language models with the reliability of symbolic verification for formal reasoning.

Prismriver: Formalization of Music Theory and Algorithmic Composition in Lean 4

Leni Aniva, Claire Wang — Thu, 18 Jun 2026 00:00:00 +0000

Presents Prismriver, a Lean 4 library formalizing music theory with Mathlib group actions and supporting verifiable algorithmic composition. Music theory obeys a rich set of mathematical rules and symmetries. These symmetries follow mathematical structure which can be verified and expressioned in the precise language of a proof assistant. In this paper, we present Prismriver, a formalization of music theory in Lean 4. By formalizing music theory in Lean 4, we open the door to verifiable algorithmic composition and accompaniment generation. We also enable the analysis of monadic analysis of structures in music.

Finishing Oltean's Completeness Proof in Lean 4 for Hybrid Logic $L(\forall)$

Lars Warren Ericson — Thu, 18 Jun 2026 00:00:00 +0000

Completes a machine-checked completeness theorem for hybrid logic L(forall) in Lean 4, building on Oltean's formalization. We present a machine-checked completeness theorem, in Lean 4, for the hybrid logic $L(\forall)$: propositional modal logic with nominals, the satisfaction-style binder $\forall$, and the box modality. (Machine-checked completeness for basic hybrid logic, without binders, was pioneered by Asta Halkjær From in Isabelle/HOL.) We build on Alex Oltean's 2023 Lean 4 formalization, which mechanized the syntax, semantics, Hilbert-style proof system, and soundness following Blackburn's Hybrid Completeness (1998), but left completeness unfinished. Finishing it requires manufacturing fresh names at two structurally different points, and our central finding is that they call for two different tools. (1) The root witnessed maximal consistent set, built by an extended Lindenbaum construction, needs at each step a nominal fresh for the whole set; the right tool is structural freshness: extend the language so an infinite supply of nominals is reserved by construction. We survey the design space (Oltean's odd/even encoding inside $\mathbb{N}$, the disjoint-sum $N \oplus \mathbb{N}$ parameterization suggested by Bud Mishra, and From's synthetic-completeness frameworks) and explain the encoding we adopt. (2) The witnessed $\Diamond$-successor of a maximal consistent set cannot be obtained this way: its canonical box-reduct provably mentions every nominal, so no reserved name is fresh. Here the right tool is one Oltean chose but left incomplete: an existence-lemma Henkin construction drawing each witness from the predecessor's witnessedness through a fresh state variable; we complete it with a data-carrying witness accumulator and a compactness argument. The theorem $Γ\models \varphi \to Γ\vdash \varphi$ is fully formalized: the development is sorry-free, and #print axioms reports only propext, Classical.choice, and Quot.sound. We port the development to Lean v4.30.0 / mathlib v4.30.0.

TheoremBench: Evaluating LLMs on Theorem Proving in Formal Mathematics

QuocViet Pham, Elvir Karimov, Andrey Galichin, Ivan Oseledets — Thu, 18 Jun 2026 00:00:00 +0000

Introduces TheoremBench, a Lean4 benchmark of ~100 classical theorems with extracted premises for evaluating LLM provers. LLMs have recently achieved strong results on formal proving benchmarks. However, existing evaluations remain heavily concentrated on competition-style problems and often fail to capture how models behave on longer, more dependency-rich mathematical developments. We introduce TheoremBench, a Lean4 benchmark designed to evaluate theorem provers beyond contest settings. The benchmark is built from nearly one hundred classical theorems and is released in two complementary forms: a plain main version containing one target theorem per instance, and a premised version that expands each theorem into a structured family of related proving tasks consisting of the main theorem together with automatically extracted supporting subtheorems. This design enables evaluation of not only whether the final theorem was proved from scratch, but also of partial progress through the internal proof structure of a theorem. Our experiments show that explicit premises substantially improve performance for Lean4-capable prover models. To provide a comprehensive evaluation, we introduce theorem-level coverage and token-efficiency metrics that expose qualitative differences in proof behavior. The results show that current provers remain strongly biased toward easy subtheorems and often solve theorems through long and inefficient tactic traces rather than compact proof plans. TheoremBench therefore provides a more fine-grained view of formal reasoning ability and highlights the importance of structural benchmark design for evaluating Lean4 theorem provers.

Lean4Agent: Formal Modeling and Verification for Agent Workflow and Trajectory

Ruida Wang, Jerry Huang, Pengcheng Wang, Xuanqing Liu, Luyang Kong, Tong Zhang — Thu, 18 Jun 2026 00:00:00 +0000

Uses Lean4 and the FormalAgentLib library to formally model and verify LLM-agent workflows and execution trajectories. Equipping Large Language Models (LLMs) to execute reliable multi-step workflows has become a central challenge in artificial intelligence. Despite recent advances in LLMs' agentic capabilities, most agent systems still lack formal methods for specifying, verifying, and debugging their workflow and execution trajectories. This challenge mirrors a long-standing problem in mathematics, where the ambiguity of natural languages (NLs) motivates the development of formal languages (FLs). Inspired by this paradigm, we propose **Lean4Agent**, to the best of our knowledge, the first framework that uses Lean4, a dependent-type FL to model and verify agent behavior. **Lean4Agent** launches **FormalAgentLib**, an extensible Lean4 library for formally modeling and verifying agent workflows' semantic consistency under explicit assumptions, and enabling localization of execution-time failures revealed by trajectories. Building on **FormalAgentLib**, we further develop **LeanEvolve**, which applies results in **FormalAgentLib** to revise workflows to enhance its capability. Extensive experiments on a hard problem subset of SWE-Bench-Verified and a subset of ELAIP-Bench across 5 leading LLMs indicate that the verification-passing workflows outperform the failing ones by an average of **11.94%**, and **LeanEvolve** further improves SWE performance by **7.47%** on average. Furthermore, **Lean4Agent** establishes a foundation for a new field of using expressive dependent-type FL to formally model and verify agent behavior.

Formalizing multi-graded Brenner-Schröer Proj schemes and dilatations of rings in Lean4

Arnaud Mayeux, Jujian Zhang — Thu, 18 Jun 2026 00:00:00 +0000

Formalizes multigraded Brenner-Schroeer Proj schemes and algebraic dilatations of rings in Lean4. We present a detailed formalization in Lean4 of some multigraded algebraic geometry constructions, focusing on the Brenner--Schröer Proj construction and algebraic dilatations of rings.

FormInv: A Measurement Protocol for Semantic Invariance in Mathematical Reasoning Benchmarks

Nishal Thomas, Noel Thomas — Thu, 18 Jun 2026 00:00:00 +0000

Builds the FormInv invariance benchmark on 103 Lean4-verified Mathlib4 theorems to measure LLM consistency across paraphrases. A paraphrase-quality audit of MathCheck (ICLR 2025) detected 4 semantically incorrect paraphrases in 129 groups (3.1%); removing them drops GPT-4o from rank 2 to rank 4 and elevates Claude Haiku and DeepSeek V3 above it; these ranking changes are invisible to any single-model evaluation. Cross-model unanimity found these errors automatically (>= 3/4 models for MathCheck; >= 6/9 for our primary evaluation) for under $10; in our own dataset the same protocol found that 47% of auto-generated connective-variation paraphrases were semantically incorrect. That flaw compounds a deeper measurement gap: Claude Haiku 4.5 achieves 86% accuracy yet SCR=50%, meaning half its theorems are answered differently under semantically equivalent restatements, while aggregate accuracy across 9 models spans only 86-96% yet Semantic Consistency Rates (SCR) span 50-82% -- a 32-point gap invisible to standard benchmarks. Formally, for any target ranking over 9 frontier models there exists a weighting over paraphrase families that realizes it (No-Free-Benchmark corollary), because no model Pareto-dominates all families -- so benchmark designers who select families are implicitly choosing which model wins. FormInv supplies the audit protocol (replicated on external benchmarks at 100% recall), SCR and per-theorem Cochran's Q as primary invariance measures evaluated on 9 models across 366-811 items (on Lean4-verified theorems), and FormInvSelector for regime-aware model selection.

Modularity, Extensions and Connectivity in Infinite Matroids

Mattias Ehatamm, Peter Nelson, Fernanda Rivera Omana — Thu, 18 Jun 2026 00:00:00 +0000

Provides a Lean4-formalized repository of all main results on modularity, extensions, and connectivity in infinite matroids. We generalize the well-studied notion of a modular pair of a finite matroid to arbitrary families of sets in infinite matroids, and use it to develop the theory of infinite matroids in several as-yet-unexplored areas. Our results include a complete theory of single-element extensions, a description of the relationship between quotients and projections, a proof that matroids for which every flat is modular must be finitary, and two new perspectives on the infinite matroid connectivity parameter λ. In most cases, existing theory for finite matroids either fails completely or does not extend in obvious ways, and as a result we develop multiple new techniques for reasoning about infinite matroids, including establishing well-behaved infinite analogues of nullity, local connectivity and skewness. We also point to an online repository containing formalized proofs of all our results using the lean4 proof assistant

Scaling Self-Play with Self-Guidance

Luke Bailey, Kaiyue Wen, Kefan Dong, Tatsunori Hashimoto, Tengyu Ma — Thu, 18 Jun 2026 00:00:00 +0000

Applies the Self-Guided Self-Play algorithm to Lean4 formal theorem proving, with a Guide model preventing conjecturer collapse. LLM self-play algorithms are notable in that, in principle, nothing bounds their learning: a Conjecturer model creates problems for a Solver, and both improve together. However, in practice, existing LLM self-play methods do not scale well with large amounts of compute, instead hitting learning plateaus. We argue this is because over long training runs, the Conjecturer learns to hack its reward, collapsing to artificially complex problems that do not help the Solver improve. To overcome this, we introduce Self-Guided Self-Play (SGS), a self-play algorithm in which the language model itself guides the Conjecturer away from degeneracy. In SGS, the model takes on three roles: Solver, Conjecturer, and a Guide that scores synthetic problems by their relevance to unsolved target problems and how clean and natural they are, providing supervision against Conjecturer collapse. Our core hypothesis is that language models can assess whether a subproblem is useful for achieving a goal. We evaluate the scaling properties of SGS by running training for significantly longer than prior works and by fitting scaling laws to cumulative solve rate curves. Applying SGS to formal theorem proving in Lean4, we find that it surpasses the asymptotic solve rate of our strongest RL baseline in fewer than 80 rounds of self-play and enables a 7B parameter model, after 200 rounds of self-play, to solve more problems than a 671B parameter model pass@4.

Improving Lean4 Autoformalization via Cycle Consistency Fine-tuning

Arsen Shebzukhov — Thu, 18 Jun 2026 00:00:00 +0000

Fine-tunes a 2B model for NL-to-Lean4 autoformalization with a GRPO cycle-consistency reward. Autoformalization - automatically translating natural language mathematical texts into formal proof language such as Lean4 - can help accelerate AI-assisted mathematical research, be it via proof verification or proof search. I fine-tune Qwen3.5-2B with LoRA for natural language to Lean4 formalization on FineLeanCorpus and consider three training regimes: supervised fine-tuning (SFT) with curriculum learning (difficulty 1 to 10), SFT without curriculum ordering, and reinforcement learning using group relative policy optimization (GRPO) with a cycle consistency reward. Cycle consistency measures how well the meaning of a statement is preserved through a NL to Lean4 to NL' loop, computed as cosine similarity of off-the-shelf sentence embeddings. On an unseen subset of FineLeanCorpus (FLC) and on PutnamBench, RL substantially outperforms both SFT variants (mean cycle consistency 0.669 vs. 0.513 on FLC; 0.561 vs. 0.422 on PutnamBench), while increasing cross-entropy loss by only 0.011 nats, with minimal impact on formalization quality. Curriculum ordering provides no measurable benefit over shuffled training.

LongCat-Flash-Prover: Advancing Native Formal Reasoning via Agentic Tool-Integrated Reinforcement Learning

Thu, 18 Jun 2026 00:00:00 +0000

LongCat-Flash-Prover, a 560B-parameter MoE, performs native formal reasoning in Lean4 via agentic tool-integrated RL. We introduce LongCat-Flash-Prover, a flagship 560-billion-parameter open-source Mixture-of- Experts (MoE) model that advances Native Formal Reasoning in Lean4 through agentic tool-integrated reasoning (TIR). We decompose the native formal reasoning task into three independent formal capabilities, i.e., auto-formalization, sketching, and proving. To facilitate these capabilities, we propose a Hybrid-Experts Iteration Framework to expand high-quality task trajectories, including generating a formal statement based on a given informal problem, producing a whole-proof directly from the statement, or a lemma-style sketch. During agentic RL, we present a Hierarchical Importance Sampling Policy Optimization (HisPO) algorithm, which aims to stabilize the MoE model training on such long-horizon tasks. It employs a gradient masking strategy that accounts for the policy staleness and the inherent train-inference engine discrepancies at both sequence and token levels. Additionally, we also incorporate theorem consistency and legality detection mechanisms to eliminate reward hacking issues. Extensive evaluations show that our LongCat-Flash-Prover sets a new state-of-the-art for open-weights models in both auto-formalization and theorem proving. Demonstrating remarkable sample efficiency, it achieves a 97.1% pass rate on MiniF2F-Test using only 72 inference budget per problem. On more challenging benchmarks, it solves 70.8% of ProverBench and 41.5% of PutnamBench with no more than 220 attempts per problem, significantly outperforming existing open-weights baselines.

A Symplectic Proof of the Quantum Singleton Bound

Frederick Dehmel, Shilun Li — Thu, 18 Jun 2026 00:00:00 +0000

Gives a symplectic linear-algebraic proof of the Quantum Singleton Bound with a Lean4 formalization of the argument. We present a symplectic linear-algebraic proof of the Quantum Singleton Bound for stabiliser quantum error-correcting codes together with a Lean4 formalisation of the linear-algebraic argument. The proof is formulated in the language of finite-dimensional symplectic vector spaces modelling Pauli operators and relies on distance-based erasure correctability and the cleaning lemma. Using a dimension-counting argument within the symplectic stabiliser framework, we derive the bound $k + 2(d-1) \le n$ for any $[[n, k, d]]$ stabiliser code. This approach isolates the algebraic structure underlying the bound and avoids the heavier analytic machinery that appears in entropy-based proofs, while remaining well-suited to formal verification.

The Axiom of Consent: Friction Dynamics in Multi-Agent Coordination

Murad Farzulla — Thu, 18 Jun 2026 00:00:00 +0000

Provides machine-checked Lean 4 proofs of the core comparative-statics of a consent-based coordination-friction framework. Multi-agent systems must coordinate despite heterogeneous preferences, asymmetric stakes, and imperfect information. When coordination fails, friction emerges: measurable resistance such as deadlock, thrashing, or conflict. We derive a formal framework for coordination friction from a single axiom: actions affecting agents require their authorization in proportion to stakes. From this axiom of consent we establish the kernel triple $(α, σ, \varepsilon)$ -- alignment, stake, and entropy -- as sufficient statistics for any resource-allocation configuration, and propose a friction functional whose simplest candidate form $F = σ(1+\varepsilon)/(1+α)$ predicts that friction rises with stakes and entropy and falls with alignment. We stress that this form is a phenomenological ansatz, not a theorem -- the simplest expression satisfying our desiderata -- whose empirical adequacy, in particular whether the alignment dependence is monotone, remains open. A companion study tests it in a multi-agent reinforcement-learning environment, finds the linear alignment dependence falsified by a U-shaped relationship, and motivates a quadratic form $F = σ(1+\varepsilon)/(1+α^2)$ that we characterize axiomatically as a refinement for future confirmation. The Replicator-Optimization Mechanism governs selection over coordination strategies: lower-friction configurations persist longer, making consent-respecting arrangements dynamical attractors rather than normative ideals. We give formal definitions for resource consent, coordination legitimacy, and friction-aware allocation, a measurement apparatus, and machine-checked Lean 4 proofs of the core comparative-statics. Illustrative applications to cryptocurrency governance and political legitimacy show one architecture spanning domains, offered as candidate unification, not established identity.

The Replicator-Optimization Mechanism: A Scale-Relative Formalism for Persistence-Conditioned Dynamics with Application to Consent-Based Metaethics

Murad Farzulla — Thu, 18 Jun 2026 00:00:00 +0000

Includes machine-checked Lean proofs of core algebraic results for a scale-relative replicator-optimization formalism. This paper formalizes a widely used dynamical class--replicator-mutator dynamics and Price-style selection-and-transmission--and makes explicit the modeling choices (scale, atomic unit, interaction topology, transmission kernel) that determine how this class instantiates across domains. The backbone is known; we do not claim to have discovered selection. The novel contributions are threefold: (i) a scale-relative kernel parameterization where atomic units are themselves parameters, enabling systematic instantiation across physics, biology, economics, cognition, and social organization; (ii) a consent-friction instantiation for political philosophy, where friction is the primitive, legitimacy functions as survival probability, and belief-transfer functions as mutation kernel; and (iii) a derivation path from social contract theory rather than from biology or physics, arriving at the same formal structure via an independent route. We provide a bridge principle connecting descriptive dynamics to instrumental normativity: if agents prefer lower expected friction, then "ought" claims are shorthand for policies that reduce expected friction under the specified dynamics. This conditional structure avoids the is-ought fallacy while grounding normative discourse in empirically tractable dynamics. We address pathological cases (authoritarian stability, suppressed friction) through explicit modeling of latent versus observed friction. The framework generates testable predictions through operationalization of friction, legitimacy, and belief-transfer dynamics, and is falsifiable at the level of measurement apparatus rather than formal structure.

Lean4Physics: Comprehensive Reasoning Framework for College-level Physics in Lean4

Thu, 18 Jun 2026 00:00:00 +0000

Introduces Lean4PHYS with PhysLib and the LeanPhysBench benchmark of 200 Lean4 college-physics statements. We present **Lean4PHYS**, a comprehensive reasoning framework for college-level physics problems in Lean4. **Lean4PHYS** includes *LeanPhysBench*, a college-level benchmark for formal physics reasoning in Lean4, which contains 200 hand-crafted and peer-reviewed statements derived from university textbooks and physics competition problems. To establish a solid foundation for formal reasoning in physics, we also introduce *PhysLib*, a community-driven repository containing fundamental unit systems and theorems essential for formal physics reasoning. Based on the benchmark and Lean4 repository we composed in **Lean4PHYS**, we report baseline results using major expert Math Lean4 provers and state-of-the-art closed-source models, with the best performance of DeepSeek-Prover-V2-7B achieving only 16% and Claude-Sonnet-4 achieving 35%. We also conduct a detailed analysis showing that our *PhysLib* can achieve an average improvement of 11.75% in model performance. This demonstrates the challenging nature of our *LeanPhysBench* and the effectiveness of *PhysLib*. To the best of our knowledge, this is the first study to provide a physics benchmark in Lean4.

Formalization of Auslander--Buchsbaum--Serre criterion in Lean4

Naillin Guan, Yongle Hu — Thu, 18 Jun 2026 00:00:00 +0000

Formalizes the Auslander-Buchsbaum-Serre criterion characterizing regular local rings in Lean4, with supporting homological algebra. We present a comprehensive formalization in the Lean4 theorem prover of the Auslander--Buchsbaum--Serre criterion, which characterizes regular local rings as those Noetherian local rings with finite global dimension. Rather than following the well-known proof that computes the projective dimension of the residue field via quotient by regular sequences and uses the Koszul complex to bound the cotangent space dimension by the global dimension, our approach is built systematically on the formalization of depth defined via the vanishing of Ext functors. We establish key homological results including Rees' theorem, the Auslander--Buchsbaum formula, and Ischebeck's theorem, and further develop the theories of Cohen--Macaulay modules and rings, including a complete formalization of the unmixedness theorem for Cohen--Macaulay rings. To prove the Auslander--Buchsbaum--Serre criterion, we show that maximal Cohen--Macaulay modules over regular local rings are free and establish a weakened form of the Ferrand--Vasconcelos theorem specific for the unique maximal ideal. As corollaries, we deduce that regularity can be checked at maximal ideals and formalize Hilbert's Syzygy Theorem. This work demonstrates how homological algebra can be effectively employed in the formalization of commutative algebra, providing extensive infrastructure for future developments in the field.

Dual-Regularized Riccati Recursions for Interior-Point Optimal Control

João Sousa-Pinto, Dominique Orban — Thu, 18 Jun 2026 00:00:00 +0000

Provides a full Lean formalization of dual-regularized Riccati recursions and their inertia/descent results for interior-point optimal control. We derive closed-form extensions of the sequential and parallel Riccati recursions for solving dual-regularized linear-quadratic regulator (LQR) problems, with $O(N)$ sequential time and $O(\log(N))$ parallel time, respectively. We show that these subproblems arise when using regularized primal-dual interior-point methods to solve smooth, constrained, non-convex, discrete-time optimal control problems via multiple-shooting, even in the presence of stagewise equality or inequality constraints, and without imposing any rank requirements on constraint Jacobians. We prove that, when certain inertia conditions on the Newton-KKT matrix are met, each nonzero primal step is a descent direction of an augmented barrier-Lagrangian merit function. We characterize these inertia conditions in terms of the positive-definiteness of the dual-regularized Riccati pivots (a weaker condition than the standard LQR positive-definiteness requirements), thereby yielding inexpensive certificates of the required inertia. We provide MIT-licensed implementations of our methods in C++ and in JAX, as well as a full formalization of our results in Lean. We benchmark our algorithm against leading optimal control and nonlinear programming solvers on complex trajectory optimization problems, establishing competitive performance on moderate problems and substantial gains as the horizon length, problem dimension, and constraint count increase.

A Secure Sequencer and Data Availability Committee for Rollups (Extended Version)

Thu, 18 Jun 2026 00:00:00 +0000

Mechanizes fraud-proof games for rollup sequencers and data-availability committees in LEAN4, including verified honest strategies. Blockchains face a scalability limitation, partly due to the throughput limitations of consensus protocols, especially when aiming to obtain a high degree of decentralization. Layer 2 Rollups (L2s) are a faster alternative to conventional blockchains. L2s perform most computations offchain using minimally blockchains (L1) under-the-hood to guarantee correctness. A sequencer is a service that receives offchain L2 transaction requests, batches these transactions, and commits compressed or hashed batches to L1. Using hashing needs less L1 space, which is beneficial for gas cost, but requires a data availability committee (DAC) service to translate hashes into their corresponding batches of transaction requests. The behavior of sequencers and DACs influence the evolution of the L2 blockchain, presenting a potential security threat and delaying L2 adoption. We propose in this paper fraud-proof mechanisms, arbitrated by L1 contracts, to detect and generate evidence of dishonest behavior of the sequencer and DAC. We study how these fraud-proofs limit the power of adversaries that control different number of sequencer and DACs members, and provide incentives for their honest behavior. We designed these fraud-proof mechanisms as two player games. Unlike the generic fraud-proofs in current L2s (designed to guarantee the correct execution of transactions), our fraud-proofs are over pred-etermined algorithms that verify the properties that determine the correctness of the DAC. Arbitrating over concrete algorithms makes our fraud-proofs more efficient, easier to understand, and simpler to prove correct. We provide as an artifact a mechanization in LEAN4 of our fraud-proof games, including (1) the verified strategies that honest players should play to win all games as well as (2) mechanisms to detect dishonest claims.

Geoint-R1: Formalizing Multimodal Geometric Reasoning with Dynamic Auxiliary Constructions

Thu, 18 Jun 2026 00:00:00 +0000

Geoint-R1 generates Lean4 code for auxiliary geometric constructions, with a 1,885-problem benchmark of Lean4-annotated geometry problems. Mathematical geometric reasoning is essential for scientific discovery and educational development, requiring precise logic and rigorous formal verification. While recent advances in Multimodal Large Language Models (MLLMs) have improved reasoning tasks, existing models typically struggle with formal geometric reasoning, particularly when dynamically constructing and verifying auxiliary geometric elements. To address these challenges, we introduce Geoint-R1, a multimodal reasoning framework designed to generate formally verifiable geometric solutions from textual descriptions and visual diagrams. Geoint-R1 uniquely integrates auxiliary elements construction, formal reasoning represented via Lean4, and interactive visualization. To systematically evaluate and advance formal geometric reasoning, we propose the Geoint benchmark, comprising 1,885 rigorously annotated geometry problems across diverse topics such as plane, spatial, and solid geometry. Each problem includes structured textual annotations, precise Lean4 code for auxiliary constructions, and detailed solution steps verified by experts. Extensive experiments demonstrate that Geoint-R1 significantly surpasses existing multimodal and math-specific reasoning models, particularly on challenging problems requiring explicit auxiliary element constructions.

Understanding Haskell-style Overloading via Open Data and Open Functions

Andrew Marmaduke, Apoorv Ingle, J. Garrett Morris — Thu, 18 Jun 2026 00:00:00 +0000

Mechanizes the metatheory of System F_D, a core language for Haskell-style overloading, in the Lean4 theorem prover. We present a new, uniform semantics for Haskell-style overloading. We realize our approach in a new core language, System F$_\mathrm{D}$, whose metatheory we mechanize in the Lean4 interactive theorem prover. System F$_\mathrm{D}$ is distinguished by its open data types and open functions, each given by a collection of instances rather than by a single definition. We show that System F$_\mathrm{D}$ can encode advanced features of Haskell's of type class systems, more expressively than current semantics of these features, and without assuming additional type equality axioms.

FormalMATH: Benchmarking Formal Mathematical Reasoning of Large Language Models

Thu, 18 Jun 2026 00:00:00 +0000

Presents FormalMATH, a Lean4 benchmark of 5,560 formally verified problems built with a human-in-the-loop autoformalization pipeline. Formal mathematical reasoning remains a critical challenge for artificial intelligence, hindered by limitations of existing benchmarks in scope and scale. To address this, we present FormalMATH, a large-scale Lean4 benchmark comprising 5,560 formally verified problems spanning from high-school Olympiad challenges to undergraduate-level theorems across diverse domains (e.g., algebra, applied mathematics, calculus, number theory, and discrete mathematics). To mitigate the inefficiency of manual formalization, we introduce a novel human-in-the-loop autoformalization pipeline that integrates: (1) specialized large language models (LLMs) for statement autoformalization, (2) multi-LLM semantic verification, and (3) negation-based disproof filtering strategies using off-the-shelf LLM-based provers. This approach reduces expert annotation costs by retaining 72.09% of statements before manual verification while ensuring fidelity to the original natural-language problems. Our evaluation of state-of-the-art LLM-based theorem provers reveals significant limitations: even the strongest models achieve only 16.46% success rate under practical sampling budgets, exhibiting pronounced domain bias (e.g., excelling in algebra but failing in calculus) and over-reliance on simplified automation tactics. Notably, we identify a counterintuitive inverse relationship between natural-language solution guidance and proof success in chain-of-thought reasoning scenarios, suggesting that human-written informal reasoning introduces noise rather than clarity in the formal reasoning settings. We believe that FormalMATH provides a robust benchmark for benchmarking formal mathematical reasoning.

Formalization of Optimality Conditions for Smooth Constrained Optimization Problems

Chenyi Li, Shengyang Xu, Chumin Sun, Li Zhou, Zaiwen Wen — Thu, 18 Jun 2026 00:00:00 +0000

Formalizes the KKT first-order optimality conditions for smooth constrained optimization in Lean4, including the Farkas lemma and weak duality. Optimality conditions are central to analysis of optimization problems, characterizing necessary criteria for local minima. Formalizing the optimality conditions within the type-theory-based proof assistant Lean4 provides a precise, robust, and reusable framework essential for rigorous verification in optimization theory. In this paper, we introduce a formalization of the first-order optimality conditions (also known as the Karush-Kuhn-Tucker (KKT) conditions) for smooth constrained optimization problems by beginning with concepts such as the Lagrangian function and constraint qualifications. The geometric optimality conditions are then formalized, offering insights into local minima through tangent cones. We also establish the critical equivalence between the tangent cone and linearized feasible directions under appropriate constraint qualifications. Building on these key elements, the formalization concludes the KKT conditions through the proof of the Farkas lemma. Additionally, this study provides a formalization of the dual problem and the weak duality property.

Formalization of Algorithms for Optimization with Block Structures

Thu, 18 Jun 2026 00:00:00 +0000

Formalizes convergence of block coordinate descent and ADMM for block-structured optimization in Lean4, with subdifferentials and the KL property. Block-structured problems are central to advances in numerical optimization and machine learning. This paper provides the formalization of convergence analysis for two pivotal algorithms in such settings: the block coordinate descent (BCD) method and the alternating direction method of multipliers (ADMM). Utilizing the type-theory-based proof assistant Lean4, we develop a rigorous framework to formally represent these algorithms. Essential concepts in nonsmooth and nonconvex optimization are formalized, notably subdifferentials, which extend the classical differentiability to handle nonsmooth scenarios, and the Kurdyka-Lojasiewicz (KL) property, which provides essential tools to analyze convergence in nonconvex settings. Such definitions and properties are crucial for the corresponding convergence analyses. We formalize the convergence proofs of these algorithms, demonstrating that our definitions and structures are coherent and robust. These formalizations lay a basis for analyzing the convergence of more general optimization algorithms.

FANS -- Formal Answer Selection for Natural Language Math Reasoning Using Lean4

Jiarui Yao, Ruida Wang, Tong Zhang — Thu, 18 Jun 2026 00:00:00 +0000

FANS uses Lean4 to formally verify candidate answers for natural-language math reasoning and improve answer selection. Large Language Models (LLMs) have displayed astonishing abilities in various tasks, especially in text generation, classification, question answering, etc. However, the reasoning ability of LLMs still faces many debates. The inherent ambiguity of Natural Language (NL) limits LLMs' ability to perform verifiable reasoning, making its answers lack coherence and trustworthy support. To tackle the above problems, we propose a novel framework named FANS: Formal ANswer Selection for Natural Language Math Reasoning Using Lean4. To the best of our knowledge, it is the first framework that utilizes Lean4 to enhance LLMs' NL math reasoning ability. In particular, given an NL math question and LLM-generated answers, FANS first translates it into Lean4 theorem statements. Then it tries to prove it using a Lean4 prover and verify it by Lean4. Finally, it uses the FL result to assist in answer selection. It enhances LLMs' NL math ability in providing a computer-verifiable solution for its correct answer and proposes an alternative method for answer selection beyond the reward model. Extensive experiments indicate the effectiveness of our framework. It can improve the accuracy rate of reward model enhanced LLMs in the MATH-500 dataset by at most 1.91% and AMC-23 by at most 8.33% on strong reward-model baselines. In some particular fields like number theory that Lean4 experts in, we can even select all correct solutions. The qualitative analysis also shows our framework can make NL results formally backed by Lean4 proofs. As a pioneering work in the corresponding field, we will open-source all our models and datasets to further boost the development of the field.

BFS-Prover: Scalable Best-First Tree Search for LLM-based Automatic Theorem Proving

Thu, 18 Jun 2026 00:00:00 +0000

BFS-Prover scales best-first tree search with expert iteration and DPO for LLM theorem proving in Lean4. Recent advancements in large language models (LLMs) have spurred growing interest in automatic theorem proving using Lean4, where effective tree search methods are crucial for navigating the underlying large proof search spaces. While the existing approaches primarily rely on value functions and/or Monte Carlo Tree Search (MCTS), the potential of simpler methods like Best-First Tree Search (BFS) remains underexplored. In this paper, we investigate whether BFS can achieve competitive performance in large-scale theorem proving tasks. We present BFS-Prover, a scalable expert iteration framework, featuring three key innovations. First, we implement strategic data filtering at each expert iteration round, excluding problems solvable via beam search node expansion to focus on harder cases. Second, we improve the sample efficiency of BFS through Direct Preference Optimization (DPO) applied to state-tactic pairs automatically annotated with compiler error feedback, refining the LLM's policy to prioritize productive expansions. Third, we employ length normalization in BFS to encourage exploration of deeper proof paths. BFS-Prover achieves a state-of-the-art score of $72.95\%$ on the MiniF2F test set and therefore challenges the perceived necessity of complex tree search methods, demonstrating that BFS can achieve competitive performance when properly scaled. To facilitate further research and development in this area, we have open-sourced our model at https://huggingface.co/ByteDance-Seed/BFS-Prover-V1-7B.

From Informal to Formal -- Incorporating and Evaluating LLMs on Natural Language Requirements to Verifiable Formal Proofs

Thu, 18 Jun 2026 00:00:00 +0000

Builds 18k instruction-response pairs across five formal languages including Lean4, evaluating and fine-tuning LLMs on verifiable formal proofs. The research in AI-based formal mathematical reasoning has shown an unstoppable growth trend. These studies have excelled in mathematical competitions like IMO and have made significant progress. This paper focuses on formal verification, an immediate application scenario of formal reasoning, and breaks it down into sub-tasks. We constructed 18k high-quality instruction-response pairs across five formal specification languages (Coq, Lean4, Dafny, ACSL, and TLA+) by distilling gpt-4o and evaluated against ten open-sourced LLMs, including recent popular DeepSeek-R1. We also fine-tuned several 7~8B small models to achieve comparable performance with Deepseek-R1-671B. Interestingly, we observed that fine-tuning with formal data also enhances mathematics, reasoning, and coding capabilities. Fine-tuned models are released at https: //huggingface.co/fm-universe.

DeFAb: A Verifiable Benchmark for Defeasible Abduction in Foundation Models

Patrick Cooper, Alvaro Velasquez — Wed, 17 Jun 2026 00:00:00 +0000

The CONJURE track uses Lean 4/Mathlib as a kernel verifier for 560 transformative-creativity instances whose gold answers are new definitions. A rule-based logic solver resolves every instance in our benchmark in under 50 microseconds with 100% accuracy; the best frontier language model reaches 65% at best and drops to 23.5% under rendering-robust evaluation (worst case over four surface renderings). We introduce DeFAb (Defeasible Abduction Benchmark), a dataset and generation pipeline that converts four decades of publicly funded knowledge bases into formally grounded instances for defeasible abduction: constructing hypotheses that explain anomalies by overriding defaults while preserving unrelated expectations. Because every hypothesis must pass polynomial-time checks for valid derivation, conservativity, and minimality, DeFAb makes logical rigor the instrument for measuring creativity and theoretical reasoning, scoring the disciplined construction of theory revisions rather than fluent but theory-destroying prose. The pipeline pairs taxonomic hierarchies (OpenCyc, YAGO, Wikidata) with behavioral property graphs (ConceptNet, UMLS) to produce 372,648+ instances across 33.75M materialized rules from 18 sources, in three levels with polynomial-time verifiable gold standards. Four frontier models do not reliably internalize defeasible reasoning: rendering-robust Level 2 accuracy is 7.8-23.5%; chain-of-thought variance (~36 pp) exceeds any inter-model gap; and a matched contamination control isolates a +19.4 pp Level 3 gap. We further release DeFAb-Hard (a 235-instance Level 3 difficulty variant; best model 53.3% vs 100% symbolic) and CONJURE (a kernel-verified transformative-creativity variant of 560 Lean 4/Mathlib instances whose gold answers are definitions the proof kernel did not previously contain, judge-free verifier; a pilot finds zero novel concepts). The same verifier doubles as an exact reward for preference optimization (DPO, RLVR/GRPO). Released under MIT at https://huggingface.co/datasets/PatrickAllenCooper/DeFAb.

Exact 6-cut rigidity and small-order superconnectivity for the 6-regular case of Dirac's k=4 problem

Alper Ferudun — Wed, 17 Jun 2026 00:00:00 +0000

Machine-checks several supporting local-colouring lemmas in Lean 4/Mathlib for the 6-regular case of Dirac's k=4 problem. Dirac asked in 1970 whether for every k >= 4 there is a k-vertex-critical graph without critical edges; Jensen settled all k >= 5, and only k=4 remains open. Following Skottova and Steiner, call a graph G a (4,1)-graph if chi(G)=4, chi(G-v)=3 for every vertex v, and chi(G-e)=4 for every edge e; they proved delta(G) >= 6 and lambda(G) >= 6 for every (4,1)-graph and asked whether a 6-regular (4,1)-graph exists. We prove three results about this 6-regular case. Theorem A (computational): there is no 6-regular 4-vertex-critical graph on n <= 15 vertices, except for a unique graph (up to isomorphism) on n=13, whose 13 critical edges form a Hamilton cycle; hence any 6-regular (4,1)-graph has at least 16 vertices. Theorem B: in a 6-regular (4,1)-graph every 6-edge-cut is either the edge star of a vertex or has both shores of size at least 15; consequently every 6-regular (4,1)-graph on at most 29 vertices is super-6-edge-connected. Theorem C (all sizes): no shore of a nontrivial 6-edge-cut in a 6-regular (4,1)-graph induces a bipartite graph; more generally, a shore whose deficiency is concentrated on two vertices forces them to receive equal colours in every proper 3-colouring. The proof of Theorem B rests on an exact classification of the 3x3 cut matrices of 6-edge-cuts in (4,1)-graphs (exactly 21 matrices, five types up to row/column permutations) together with a boundary-shortfall lemma; the unique near-miss is K_{3,3,3} minus a rainbow 3-matching. Several supporting lemmas are machine-checked in Lean 4/Mathlib.

A Formalization of Austrian Economics. Praxeological Foundations: The Base System and Its Derived Theorems

Rafał Komendarczyk, Walter Block, John Levendis, Frank Tipler — Wed, 17 Jun 2026 00:00:00 +0000

A Lean companion encodes the praxeology axioms as type classes and constructs concrete models whose type-checking is a constructive consistency proof. This paper presents an axiomatization of Ludwig von Mises' praxeology in many-sorted first-order logic, isolating the foundational layer. We introduce a formal language with five sorts ({\sf Actors}, {\sf Actions}, {\sf Ends}, {\sf Things}, {\sf Times}) and six primitive relations ({\em Acts}, {\em Avail}, {\em EndOf}, {\em Use}, a preference order, and a time order), together with a base axiom system organised into three layers: the structure of action itself, the actor's preference order together with its revelation in choice, and material scarcity. The base system captures purposeful action in its bare praxeological form. Working entirely within the base system we derive the core classical Misesian propositions as Hilbert-style theorems: the asymmetry of revealed preference, the existence of opportunity cost, the structural scarcity of time, the subjectivity of opportunity cost, the law of diminishing marginal utility, and the increasing marginal disutility of labour. Where a theorem requires structure beyond the praxeological core -- as with diminishing marginal utility -- the additional premises are made explicit; identifying these hidden premises is one of the methodological payoffs of the approach. A self-contained {\em Lean} companion encodes the language as {\em Lean} type classes and constructs concrete models -- a three-period Robinson Crusoe economy and its infinite-time extension -- whose acceptance by the type-checker is a constructive consistency proof of the full base theory.

Visored: A Controlled-Natural-Language Prover for LLM-Generated Mathematics

Xiyu Zhai, Xinyi Chen, Yiping Wang, Runlong Zhou, Liao Zhang, Simon S. Du — Tue, 16 Jun 2026 00:00:00 +0000

A controlled-natural-language prover whose accepted proofs are re-emitted as checked Lean files, driven by an LLM on miniF2F. We present a dependent-type-based prover designed around the way LLMs (and humans) tend to write mathematics, complementing existing systems such as Lean and Rocq. Its core design choices are a surface that imitates mathematical natural language and a rule-driven automation layer that closes the routine steps a textbook would omit, so that an accepted proof can be re-emitted as a checked Lean file. Early experiments suggest that, even without any prover-specific training data, LLMs can learn to use it effectively on the miniF2F benchmark. Lean output excerpts: https://github.com/xiyuzhai-husky-lang/visored/