BlockCert: Certified Blockwise Extraction of Transformer Mechanisms

Sandro Andric

cs.LG Nov 20, 2025 · v1

Read PDF arXiv abstract

TL;DR

Formalizes a Lipschitz-based composition theorem in Lean 4 that lifts per-block extraction error bounds to a global model deviation bound.

Abstract

Mechanistic interpretability aspires to reverse-engineer neural networks into explicit algorithms, while model editing seeks to modify specific behaviours without retraining. Both areas are typically evaluated with informal evidence and ad-hoc experiments, with few explicit guarantees about how far an extracted or edited model can drift from the original on relevant inputs. We introduce BlockCert, a framework for certified blockwise extraction of transformer mechanisms, and outline how a lightweight extension can support certified local edits. Given a pre-trained transformer and a prompt distribution, BlockCert extracts structured surrogate implementations for residual blocks together with machine-checkable certificates that bound approximation error, record coverage metrics, and hash the underlying artifacts. We formalize a simple Lipschitz-based composition theorem in Lean 4 that lifts these local guarantees to a global deviation bound. Empirically, we apply the framework to GPT-2 small, TinyLlama-1.1B-Chat, and Llama-3.2-3B. Across these models we obtain high per-block coverage and small residual errors on the evaluated prompts, and in the TinyLlama setting we show that a fully stitched model matches the baseline perplexity within approximately 6e-5 on stress prompts. Our results suggest that blockwise extraction with explicit certificates is feasible for real transformer language models and offers a practical bridge between mechanistic interpretability and formal reasoning about model behaviour.

Problem

Mechanistic interpretability and model editing lack explicit guarantees about how far an extracted or edited model can drift from the original on relevant inputs.

Approach

BlockCert extracts structured surrogate implementations for transformer residual blocks together with machine-checkable certificates bounding approximation error. A Lipschitz-based composition theorem formalized in Lean 4 lifts per-block local guarantees to a global deviation bound. The framework records coverage metrics and hashes underlying artifacts.

Results

Applied to GPT-2 small, TinyLlama-1.1B-Chat, and Llama-3.2-3B, BlockCert obtains high per-block coverage and small residual errors. A fully stitched TinyLlama model matches the baseline perplexity within approximately 6e-5 on stress prompts.

Papers With

BlockCert: Certified Blockwise Extraction of Transformer Mechanisms