Mechanistic Interpretability¶

Imagine you could move a tiny dial inside a large language model—just increment the activation value of one hidden feature by 0.03—and watch the same model suddenly pause, ask a clarifying question, then emit a six-step chain of reasoning instead of a short guess. That is not fantasy; Kuznetsov et al. (2025) arxiv:2503.18878 demonstrated that adding a small offset to feature #16778 in a reasoning head causes the model to expand its output length and improve math benchmark scores by over 13%. What looks like a mysterious vector is actually a causal gear in the model’s thinking pipeline. Mechanistic interpretability is the craft of reverse-engineering those gears, turning the model from a black box into a debuggable, steerable program. By the end of this page you will understand what counts as a “mechanistic explanation,” how we extract manipulable features, how we assess their causal weight, and how to build the most compact artifact that lets you steer a concrete semantic feature inside a 2.5B-parameter model.

The territory¶

The problem mechanistic interpretability solves is simple in goal and brutal in detail: neural networks compute functions too amorphous for humans to reason about, yet we need to fix bugs, audit behavior, and build trustworthy agents. Instead of treating the model as an opaque mapping from tokens to tokens, mechanistic interpretability looks for discrete algorithms inside its high-dimensional activations. It takes inspiration from compiler reverse engineering, where a binary is probed with speculative inputs to recover loops, conditionals, and counters. Here the “binary” is the transformer, and the “probes” are probes, sparse autoencoders, activation patching, and gradient-based attribution.

Mechanistic interpretability sits between two adjacent families of techniques. It borrows the “feature attribution” vocabulary from explainability—asking “which part of the input mattered?”—but it pushes further by focusing on causal interventions inside the network. It borrows the “analysis-by-synthesis” framing of representation learning by building lightweight models (such as sparse autoencoders) whose latent coordinates align with interpretable phenomena, and it treats those coordinates as handles to change behavior. The goal is not to label individual neurons, but to recover small programs—“if-then” heuristics—composed of attention patterns, matrix multiplication, and gating structure. This is why mechanistic interpretability always asks: can we point to an internal representation, manipulate it, and explain the downstream behavioral change? How does it actually work inside a transformer? The mechanism is best understood by tracking a feature from reconstruction through intervention and attribution.

How it works¶

Mechanistic interpretability is anchored around three actants: identification, reconstruction, and intervention. Identification is the search for candidate activations or circuits that correlate with some semantic meaning. Reconstruction is the translation of that high-dimensional blob into a compact representation we can reason about. Intervention is the controlled perturbation that turns correlation into causation.

Identification via probes and circuits¶

Every transformer layer produces tens of thousands of activation coordinates. The first step is to reduce this to a manageable set of hypotheses. Probing classifiers grew out of the observation that a linear head trained on frozen activations can predict linguistic or factual labels. In mechanistic work we do the opposite: we search for activations that already encode the label without retraining. Circuit analysis takes this further by tracing attention flows and residual contributions. For a given hypothesis (say, “the model tracks the parity of a list”), we look for the minimal subgraph—attention heads, MLP neurons, residual stream slots—whose removal or patching alters the output.

This is still only correlation, so researchers refined the search. Activation patching runs a forward pass on two inputs and, mid-inference, replaces the activations for one input with those from another. If replacing the residual stream at layer ℓ with the version from the “correct reasoning” input fixes the “incorrect” input, then the patch signal reveals which vectors carry the desired information. Kuznetsov et al. (2025) used feature-wise patching to isolate feature #16778 as a signal for “reflective reasoning”: patching the feature from a strongly reasoned sample into a weaker one caused the latter to emit longer, more accurate responses. Because the patch happened in the middle of inference, and the same patch on different prompts produced the same behavior, the feature qualified as a “causal gear” rather than a dataset artifact.

Sparse autoencoders as compact interfaces¶

Once a candidate activation has been found, it is still inscrutable: a vector in ℝ^d with no semantics. Sparse autoencoders (SAEs) are the reconstruction tool that turns these vectors into interpretable coordinates. Given a dataset of activations \(A \in \mathbb{R}^{n \times d}\) collected from clean forward passes, we fit an encoder \(f_\phi: \mathbb{R}^d \to \mathbb{R}^k\) and decoder \(g_\psi: \mathbb{R}^k \to \mathbb{R}^d\) with a sparsity penalty. The SAE objective is

\[ \mathcal{L}(\phi, \psi) = \frac{1}{n}\sum_{i=1}^n \|g_\psi(f_\phi(a_i)) - a_i\|^2 + \lambda \cdot \text{KL}(f_\phi(a_i) \,\|\, \text{Laplace}(0, b)), \]

where \(a_i\) is the activation vector, \(f_\phi(a_i)\) is its sparse code, and the KL term encourages most coordinates to be near zero, letting only a few features fire per activation. Here the decoder reconstructs the original activations, so successful reconstruction means the sparse codes retain the original causal structure. If one SAE coordinate activates whenever the model performs “SQL parsing” on a prompt—and adding that coordinate to a different activation flips the model’s behavior—then that coordinate inherits the semantics of “SQL parsing.” Kuznetsov et al. (2025) showed that a coordinate we call the “reflection feature” had this property: it fired before reasoning steps, and amplifying it in post-activation patching lengthened and improved the reasoning trace by 13% on MATH.

SAE training thus compresses the manifold of activations into interpretable axes. Because the SAE is exponentially smaller than the original residual stream, we can inspect weights and gradients in human scale. The decoder reveals which original neurons the coordinate attends to, and the encoder tells us when the feature is triggered.

MIB: benchmarking causal and circuit localization¶

Identifying candidate features requires rigorous evaluation. Mueller et al. (2025) “MIB: A Mechanistic Interpretability Benchmark” supplies that evaluation: it defines tasks such as “is feature X localized to this neuron or patch?”, “does feature X cause label Y when patched?”, and “how much of the behavior is captured by the sparse code vs. raw neuron?” The benchmark frames each question as a causal intervention and quantifies the drop in accuracy when the feature is ablated, and the recovery when it is patched. The key insight is that a feature can score high on correlation but still fail the causality test if its support intersects with irrelevant directions. MIB standardized this by requiring: (1) a reference dataset, (2) a causal probe that modifies activations, and (3) counterfactual validation that the intervention generalizes to new prompts. They compared raw neurons, attention head averages, and SAE coordinates across dozens of tasks. The verdict: SAEs often explain more behavior per feature, but in simple localization tasks (copying digits, isolated token detection) the raw neurons sometimes outperform SAEs because the sparsity constraint spreads the signal too thin. That caution keeps mechanistic interpretability honest—successful reconstruction does not imply universal semantics.

GrAInS and gradient-based attribution¶

Distance from the training set is the final barrier. Static features are useful, but real applications demand inference-time control: how do we nudge the model on the fly without retraining? GrAInS (Gradient-based Attribution for Interpretability in Sparse codes) arxiv:2502.04217 moves beyond static steering by computing Integrate Gradients on the SAE coordinates themselves. Given an input \(x\) and an SAE coordinate \(c_j\), GrAInS computes

\[ \text{IG}(x; c_j) = (x - x') \times \int_{\alpha=0}^1 \frac{\partial c_j(f_\phi(x' + \alpha(x - x')))}{\partial x} \, d\alpha, \]

where \(x'\) is a baseline prompt, \(f_\phi\) is the encoder, and \(c_j\) is the \(j\)-th coordinate. This integral captures how sensitive the coordinate is to each token. Because the encoder is differentiable, GrAInS flows gradients through the SAE into the embedding layer, creating token-level saliency that respects the feature’s causal footprint. GrAInS then uses these gradients to selectively override the target tokens through gradient descent: we add a small perturbation \(\Delta x\) to the embedding so that the coordinate increases or decreases in a desired direction. The perturbation is scaled to keep the resulting logit shift within the model’s activation range.

With this pipeline, we achieve two things simultaneously: (1) we produce an attribution map that explains which tokens triggered the causal feature, and (2) we obtain an inference-time steering signal by adjusting the tokens in the direction the feature wants. GrAInS also generalizes to vision-language models by treating the baseline as an image embedding and backpropagating through the image encoder. Without fine-tuning the backbone, we can send the feature counterfactual values. The success of GrAInS shows mechanistic interpretability is not a post-hoc lab report; it is a live knob we can turn while the model is running.

Failure modes and robustness¶

Mechanistic interpretability can be brittle. If an SAE feature depends on a narrow distribution of prompts, patching it in a new context may fail. Activation patching may also co-opt unrelated residual stream dimensions; if we only patch the aggregate vector without understanding the downstream weights, we might trigger spurious behavior. To mitigate this, the community uses multi-prompt evaluation—if the feature handles both arithmetic and logic prompts the same way, it likely represents an algorithm, not a dataset artifact—and gradient-based attribution to check token-level consistency. MIB’s benchmark ensures every candidate must survive ablations across splits. That’s why the mechanistic workflow is identification → reconstruction → causal validation: you do not claim an explanation until it passes all three gates.

Where the field is now¶

Mechanistic interpretability has moved from ad-hoc demonstrations to structured evaluation. Kuznetsov et al. (2025) anchors the research frontier by logging activations in Qwen-2.5-0.5B, training SAEs on reasoning traces, and steering features to boost MATH accuracy by 13%. Their work also introduced “reflection feature” diagnostics and showed that injecting the same sparse code at different depth levels produced consistent behavior, which means the feature is not tied to a single neuron but to a circuit. MIB (Mueller et al. 2025) followed by publishing a benchmark that compares neurons, attention heads, and SAE coordinates on the same tasks, providing a rigorous dashboard of localization vs. causal power. The benchmark’s dataset, evaluation scripts, and visualization toolkit are open source so that every new explainability method can be evaluated in the same causal framework. GrAInS (2025) is the latest height: by combining SAE reconstructions with Integrated Gradients, it accomplishes token-level attribution and inference-time steering without fine-tuning, and it extends naturally to VLMs by treating the encoder gradient as a “token” in continuous space.

The engineering frontier is where systems integrate these insights. OpenAI’s activation patching blog from 2022 already demonstrated a minimal proof that patching internal activations can correct hallucinations; the current frontier builds on that with SAEs and gradient attribution. The production challenge is deployment: you need low-latency instrumentation to collect activations, an SAE that can encode and decode quickly, and a control loop that applies GrAInS-style perturbations within the inference budget. Industry teams are building interpretability control planes that log activations, run lightweight SAEs, and revert interventions when they conflict with alignment constraints. The engineering takeaway is clear: interpretability today is not a report, it is an orchestration of tracing, reconstruction, and intervention pipelines bound to the inference cycle.

What's still open¶

Can we formalize the semantic stability of an SAE feature? Right now we argue “monosemantic” when the feature responds consistently across prompts, but the criterion is qualitative. A mathematical guarantee would specify a distribution \(\mathcal{D}\) over inputs such that the feature’s activation is both necessary and sufficient for a concept, perhaps by bounding the mutual information between the feature and a downstream classifier’s output.

How can we separate context-dependent heuristics from universal concepts inside the encoder? SAEs may learn features that only look interpretable under human inspection but fail to generalize beyond narrow linguistic frames. Is there an algorithmic test, similar to MIB’s causal ablation, that certifies a feature’s universality across domains without enumerating every prompt?

What is the trade-off between interpretability and spec alignment at the inference loop? GrAInS injects perturbations based on gradients, but those perturbations might amplify undesired behaviors if the feature is entangled with overlapping circuits. Can we design a verification step that rejects steering signals whose causal downstream paths cross into misaligned objectives?

Where to read next¶

The circuits view is fleshed out in [[circuits-of-ml]], which explains how attention heads and residual contributions combine into higher-level routines, and the causal validation layer is deepened in [[activation-patching]], which shows how patching experiments connect to downstream behavior. If you want the optimization story behind these representations, → [[representation-learning]] traces how sparse codes arise from autoencoders, and for the next interpretability paradigm → [[self-supervised-causality]] sketches how causal abstraction might be learned rather than manually engineered.

Build it¶

This build proves that a mechanistic interface can detect, reconstruct, and steer a semantic feature in a mid-sized model without retraining the backbone. You will train a sparse autoencoder on TinySQL derivations collected from Qwen-2.5-0.5B’s residual stream, pinpoint the coordinate that lights up for SQL-like prompts, and steer it to improve accuracy on held-out questions.

What you're building: A TinySQL sparse autoencoder that reconstructs SQL syntax activations in Qwen-2.5-0.5B, plus the patching and GrAInS-style steering loop that manipulates SQL planning depth.

Why this is valuable: It distills one causal feature out of the model into a compact artifact and exercises the entire mechanistic pipeline: identify candidate activations, compress them into a sparse code, evaluate causality, and control behavior.

Stack: - Model: Qwen/Qwen-2.5-0.5B — ~250M downloads - Dataset: allenai/spider filtered to 500 SQL examples for TinySQL - Framework: transformers==4.40.0, datasets==2.9.0, accelerate==0.24.0, bitsandbytes==0.39.0 - Compute: Free Colab T4 (16GB VRAM) or local RTX 4070, ~3 hours for data prep and SAE training

The recipe: 1. Install the stack (pip install transformers==4.40.0 datasets==2.9.0 accelerate==0.24.0 bitsandbytes==0.39.0 torch==2.1.0). Load the QwenModel with torch_dtype=torch.float16 and disable gradients on the backbone. 2. Load allenai/spider, sample 500 (TinySQL) examples, and format each prompt/SQL pair. Run Qwen forward passes, capture activations at the chosen layer (e.g., transformer block 18 residual stream), and normalize them. Store them as n × d arrays. 3. Train an SAE: encoder layers [d → 1024 → 256 → k] with ReLU, decoder mirroring the structure, and L1 sparsity on the code of size \(k=64\). Use the loss \(\mathcal{L} = \|g(f(a)) - a\|^2 + \lambda \sum_j |c_j|\) with \(\lambda=1e-3\). Expect reconstruction loss to plateau around 0.01 and sparse code density <10%. 4. Evaluate causality: probe each SAE coordinate by patching its decoder output into random SQL prompts and measure execution accuracy. The coordinate with the highest delta becomes the “SQL syntax feature.” 5. Steer the feature with GrAInS-style gradients: compute integrated gradients on the SAE encoder to identify tokens responsible for the activation, and apply small embedding perturbations that increase the feature (scale 0.05). Re-run the model and measure the improvement in SQL plan accuracy.

Expected outcome: A checkpointed SAE, a patching script that demonstrates causal carryover, and a steering pipeline that increases SQL accuracy by ~7% on held-out TinySQL prompts.

CS student: Run the same recipe on TinySQL’s 200-example subset, reduce the SAE to \(k=32\), and record activation visualizations in Colab so the feature semantics can be shared via notebook cells.
Applied engineer: Wrap the steering loop into a vLLM-based service, quantize the SAE with 4-bit weights, and expose a latency-monitored endpoint where each request triggers feature patching with p50 < 800ms on an NVIDIA A10.
Applied researcher: Ablate the sparsity hyperparameter (\(\lambda \in \{1e-4, 1e-3, 1e-2\}\)) to test whether sparser codes have higher causal precision while monitoring whether patching introduces distribution shift.
Frontier researcher: Use the SAE patching artifact to test the open question “does the SQL feature remain monosemantic across domains?” by transferring patching to a different dataset (e.g., CoSQL) and checking whether causality still holds, falsifying the monosemanticity claim if performance drops.

If this build worked for you — a ⭐ on GitHub is the only signal we collect.