← All papers

Structural Enforcement of Statistical Rigor in AI-Driven Discovery: A Functional Architecture

Karen Sargsyan

cs.SE Nov 10, 2025 · v1

formalization verification probability

Read PDF arXiv abstract Code

TL;DR

Provides a machine-checked Lean 4 formalization (855 lines, zero sorry) of the LORD++ online FDR-control theorem.

Abstract

AI-Scientist systems that use large language models to automate research risk generating spurious discoveries through uncontrolled multiple testing. We present a functional architecture that enforces statistical rigor at two levels: a Haskell embedded DSL (the Research monad) that makes it impossible to test a hypothesis without updating the error budget, and a declarative scaffolding technique that structurally prevents data leakage across the boundary into LLM-generated code. We ground these guarantees in a machine-checked Lean 4 formalization of the LORD++ online FDR control theorem (855 lines, zero sorry), which identifies four sufficient conditions for FDR control. Three are structural conditions -- about information flow, data separation, and test validity -- enforced by the architecture's type system and scaffolding. The fourth is an arithmetic condition: a budget invariant requiring that a wealth process remain non-negative under floating-point computation. We verify this condition over IEEE 754 doubles using SPARK/Ada, whose GNATprove toolchain statically confirms that no rounding sequence can violate the invariant and whose flow analysis independently confirms the predictability condition. The resulting verification chain -- from real-analysis proof to floating-point implementation -- is, to our knowledge, the first for any online FDR control procedure. Monte Carlo simulation (N=2000 hypotheses) and an end-to-end case study confirm that the monadic implementation controls FDR at 1.1% against a 5% target, while a naive approach inflates to 41%.

Problem

AI-Scientist systems that use LLMs to automate research risk generating spurious discoveries through uncontrolled multiple testing. Existing approaches lack structural enforcement of statistical rigor guarantees.

Approach

The authors present a functional architecture enforcing statistical rigor at two levels: a Haskell embedded DSL (the Research monad) that prevents hypothesis testing without updating the error budget, and a scaffolding technique that structurally prevents data leakage into LLM-generated code. These guarantees are grounded in a machine-checked Lean 4 formalization of the LORD++ online FDR control theorem (855 lines, zero sorry), which identifies four sufficient conditions for FDR control. The arithmetic budget invariant is verified over IEEE 754 doubles using SPARK/Ada.

Results

The resulting verification chain from real-analysis proof to floating-point implementation is the first for any online FDR control procedure. Monte Carlo simulation (N=2000 hypotheses) confirms that the monadic implementation achieves empirical FDR of 0.011 at target 0.05, compared to 0.409 for naive fixed-alpha testing.

Approach	Target FDR	Empirical FDR	Power
Naive (Fixed alpha=0.05)	0.05	0.409	0.640
Monadic (LORD++)	0.05	0.011	0.290

FDR control comparison

Papers With

Structural Enforcement of Statistical Rigor in AI-Driven Discovery: A Functional Architecture