Grain Theory: Type-Level Granularity Correctness in Data Pipelines
Nikos Karayannidis
cs.DB
Jan 2, 2026 · v1
TL;DR
Mechanically verifies grain theory's homomorphism and grain-lift composition theorems in Lean 4.
Abstract
Data transformation correctness is a fundamental challenge in data engineering: how can we verify that pipelines produce correct results before executing on production data? Existing practice relies on iterative testing over materialized data. A common cause of errors is the absence of formal reasoning about grain -- the level of detail of data -- so transformations inadvertently change granularity, yielding pathologies like fan traps (metric duplication) and chasm traps (data loss). We introduce grain theory, a type-theoretic framework that elevates grain to a composable property of any algebraic data type. It has two phases. First, a denotation of data: grain itself, defined by irreducibility and isomorphism, with no reference to functional dependencies; three grain relations forming a bounded lattice whose axioms recover Armstrong's on product types; the entity key as a derived grain; and grain-determined behavioral classes -- together the type-level triple (G[R], EK[R], BC[R]). Second, a denotation of transformations: every transformation $h$ has a grain lift $\varphi(h)$. For collections of product types under the relational algebra we prove an equi-join grain inference theorem and present CalcG, a decidable algorithm that composes grain lifts across a pipeline DAG. The central theorem -- the grain homomorphism -- ties the phases together: grain projection commutes with transformation, and grain lifts compose ($\varphi(h_2 \circ h_1) = \varphi(h_2) \circ \varphi(h_1)$). Grain-correctness is therefore verifiable at design time, before any code or query runs. As corollaries, fan traps emerge as schema-detectable grain-relation violations; chasm traps localize to a specific ordering-chain pattern; and behavioral-class violations, such as point-in-time queries on the wrong collection type, become compile-time type errors. All theorems are mechanically verified in Lean 4.
Problem
Data pipeline correctness is undermined by the absence of formal reasoning about grain (the level of detail of data), leading to fan traps (metric duplication) and chasm traps (data loss) that are only caught through iterative testing on materialized data.
Approach
The authors introduce grain theory, a type-theoretic framework that elevates grain to a composable property of algebraic data types. Grain is defined by irreducibility and isomorphism without reference to functional dependencies. Three grain relations form a bounded lattice whose axioms recover Armstrong's axioms on product types. The framework is formalized in Lean, enabling compile-time detection of granularity errors in data transformations.
Results
The framework provides type-level grain correctness guarantees for data pipelines, catching fan traps and chasm traps at compile time rather than through runtime testing. The Lean formalization establishes the theoretical foundations including lattice properties and Armstrong axiom recovery.