← All papers
First page of Semantic Identity Compression: Zero-Error Laws, Rate-Distortion, and Neurosymbolic Necessity

Semantic Identity Compression: Zero-Error Laws, Rate-Distortion, and Neurosymbolic Necessity

Tristan Simas

cs.IT Jan 20, 2026 · v1
Machine-checks all main semantic-identity-compression laws in Lean 4.
Symbolic systems operate over precise identities: variables denote specific objects, pointers target precise memory locations, and database keys refer to singular records. Neural embeddings generalize by compressing away semantic detail, but this compression creates collision ambiguity: multiple distinct entities can share the same representation value. Exact identity recovery requires additional information precisely when representation fibers have size greater than one. The residual cost is controlled by a single combinatorial object: the collision-fiber geometry of the representation map $π$. Let $A_π=\max_u |π^{-1}(u)|$ be the largest collision fiber. The finite laws include a tight fixed-length converse $L \ge \log_2 A_π$, an exact finite-block scaling law, a pointwise adaptive budget $\lceil \log_2 |π^{-1}(u)|\rceil$, and an exact fiberwise rate-distortion law for arbitrary finite sources via recoverable-mass decomposition across representation fibers. The uniform single-block formula $D^\star(L)=\max(0,1-2^L/a)$ appears as a closed-form special case when all mass lies on one collision block, where $a = A_π$ is the collision block size. The same fiber geometry determines query complexity and canonical structure for distinguishing families. Because this residual ambiguity is structural rather than representation-specific, symbolic identity mechanisms (handles, keys, pointers, nominal tags) are the necessary system-level complement to any non-injective semantic representation. All main results are machine-checked in Lean 4.

Neural embeddings compress semantic detail, creating collision ambiguity where multiple entities share the same representation. Quantifying the irreducible cost of recovering exact identity from such non-injective representations lacked a unified formal treatment.

The paper characterizes identity recovery cost through the collision-fiber geometry of the representation map pi. The largest fiber A_pi = max_u |pi^{-1}(u)| controls all bounds. Results include a tight fixed-length converse L >= log2(A_pi), finite-block scaling laws, pointwise adaptive budgets, and a fiberwise rate-distortion law via recoverable-mass decomposition. All main results are machine-checked in Lean 4.

The framework proves that symbolic identity mechanisms (handles, keys, pointers) are the necessary complement to any non-injective semantic representation. The uniform single-block distortion formula D*(L) = max(0, 1 - 2^L/a) is derived as a closed-form special case. Query complexity and canonical distinguishing families are determined by fiber geometry.