← All papers

Statistical Learning Theory in Lean 4: Empirical Processes from Scratch

Yuanhe Zhang, Jason D. Lee, Fanghui Liu

cs.LG Feb 2, 2026 · v1

formalization mathlib ml-theory

Read PDF arXiv abstract Code

TL;DR

First comprehensive Lean 4 formalization of statistical learning theory grounded in empirical process theory.

Abstract

We present the first comprehensive Lean 4 formalization of statistical learning theory (SLT) grounded in empirical process theory. Our end-to-end formal infrastructure implement the missing contents in latest Lean 4 Mathlib library, including a complete development of Gaussian Lipschitz concentration, the first formalization of Dudley's entropy integral theorem for sub-Gaussian processes, and an application to least-squares (sparse) regression with a sharp rate. The project was carried out using a human-AI collaborative workflow, in which humans design proof strategies and AI agents execute tactical proof construction, leading to the human-verified Lean 4 toolbox for SLT. Beyond implementation, the formalization process exposes and resolves implicit assumptions and missing details in standard SLT textbooks, enforcing a granular, line-by-line understanding of the theory. This work establishes a reusable formal foundation and opens the door for future developments in machine learning theory. The code is available at https://github.com/YuanheZ/lean-stat-learning-theory

Problem

Statistical learning theory (SLT) results rely on empirical process theory with subtle measure-theoretic arguments, yet no comprehensive formalization existed in Lean 4 to verify these foundations rigorously.

Approach

The authors build the first comprehensive Lean 4 formalization of SLT grounded in empirical process theory. Their infrastructure fills gaps in Lean 4 Mathlib, including Gaussian Lipschitz concentration, the first formalization of Dudley's entropy integral theorem for sub-Gaussian processes, and an application to least-squares (sparse) regression with a sharp rate. A human-AI collaborative workflow was used: humans design proof strategies while AI agents execute tactical proof construction.

Results

The formalization exposes and resolves implicit assumptions and missing details in standard SLT textbooks. It provides a reusable Lean 4 toolbox covering the chain from concentration inequalities through empirical process bounds to regression rates.

Papers With

Statistical Learning Theory in Lean 4: Empirical Processes from Scratch