Mechanistic Interpretability — AI glossary

Mechanistic interpretability is the research approach of reverse-engineering neural networks into human-understandable algorithms by identifying specific circuits, features, or attention heads that implement particular behaviors. Unlike behavioral interpretability (which only tests inputs and outputs), mechanistic interpretability aims to trace how a model actually computes, e.g., which neurons activate for "Harry Potter" or how a model tracks subject-verb agreement. For operators, this matters because local models are often smaller and more amenable to circuit analysis, and understanding these internals can help debug unexpected outputs or verify safety properties without relying on black-box testing.

Deeper dive

Mechanistic interpretability draws on techniques like activation patching, probing, and sparse autoencoders to locate and isolate computational subgraphs. A classic example is the IOI (Indirect Object Identification) circuit in GPT-2 Small, where specific attention heads copy information from previous tokens to predict the correct indirect object. Operators running local models can use tools like TransformerLens or Neuronpedia to inspect their own models. The field is still nascent—most studies focus on small models (under 7B parameters) because larger models have too many parameters to exhaustively map. For local AI, this means that interpretability findings from open-source models (e.g., Llama 3 8B) can be directly applied to the same model running on your hardware, unlike proprietary models where weights are hidden.

Practical example

Consider running Llama 3.1 8B via Ollama on an RTX 4090. Using TransformerLens, you could load the model and apply activation patching to find which attention heads handle the task "The capital of France is" → "Paris." By corrupting the activation of a specific head and measuring the drop in prediction probability, you identify a circuit responsible for factual recall. This is practical because it tells you that a particular 12-head subnetwork encodes that fact, and if the model mispredicts, you can check whether that circuit is broken.

Workflow example

In practice, an operator might clone the TransformerLens repository, load a local model (e.g., model = HookedTransformer.from_pretrained("meta-llama/Llama-3.1-8B")), and run a circuit discovery notebook. They would define a prompt, run a forward pass to cache activations, then run a corrupted prompt and patch activations from the clean run to measure logit differences. The output is a heatmap of attention heads ranked by importance. This workflow is done entirely offline, using only the model weights already downloaded via Hugging Face.