Semantic Identity Compression: Zero-Error Laws, Rate-Distortion, and Neurosymbolic Necessity
Tristan Simas
cs.IT
Jan 20, 2026 · v1
TL;DR
Machine-checks all main semantic-identity-compression laws in Lean 4.
Abstract
Symbolic systems operate over precise identities: variables denote specific objects, pointers target precise memory locations, and database keys refer to singular records. Neural embeddings generalize by compressing away semantic detail, but this compression creates collision ambiguity: multiple distinct entities can share the same representation value. Exact identity recovery requires additional information precisely when representation fibers have size greater than one. The residual cost is controlled by a single combinatorial object: the collision-fiber geometry of the representation map $π$. Let $A_π=\max_u |π^{-1}(u)|$ be the largest collision fiber. The finite laws include a tight fixed-length converse $L \ge \log_2 A_π$, an exact finite-block scaling law, a pointwise adaptive budget $\lceil \log_2 |π^{-1}(u)|\rceil$, and an exact fiberwise rate-distortion law for arbitrary finite sources via recoverable-mass decomposition across representation fibers. The uniform single-block formula $D^\star(L)=\max(0,1-2^L/a)$ appears as a closed-form special case when all mass lies on one collision block, where $a = A_π$ is the collision block size. The same fiber geometry determines query complexity and canonical structure for distinguishing families. Because this residual ambiguity is structural rather than representation-specific, symbolic identity mechanisms (handles, keys, pointers, nominal tags) are the necessary system-level complement to any non-injective semantic representation. All main results are machine-checked in Lean 4.
Problem
Neural embeddings compress semantic detail, creating collision ambiguity where multiple entities share the same representation. Quantifying the irreducible cost of recovering exact identity from such non-injective representations lacked a unified formal treatment.
Approach
The paper characterizes identity recovery cost through the collision-fiber geometry of the representation map pi. The largest fiber A_pi = max_u |pi^{-1}(u)| controls all bounds. Results include a tight fixed-length converse L >= log2(A_pi), finite-block scaling laws, pointwise adaptive budgets, and a fiberwise rate-distortion law via recoverable-mass decomposition. All main results are machine-checked in Lean 4.
Results
The framework proves that symbolic identity mechanisms (handles, keys, pointers) are the necessary complement to any non-injective semantic representation. The uniform single-block distortion formula D*(L) = max(0, 1 - 2^L/a) is derived as a closed-form special case. Query complexity and canonical distinguishing families are determined by fiber geometry.