11. Activation Patching
Activation patching (also known as causal mediation analysis or path patching) measures the causal effect of specific model components on outputs. By patching activations from a corrupted run into a clean run, you isolate which circuits carry harmful information.
Mechanics of Patching
def activation_patch_experiment(
clean_tokens, corrupted_tokens, model, layer_idx, position_idx
):
"""Patch activations at a specific layer and position."""
# Run clean and corrupted forward passes
clean_logits = model(clean_tokens).logits
corrupted_logits = model(corrupted_tokens).logits
# Cache clean activations
clean_cache = {}
def patched_forward_hook(module, input, output):
patched_output = list(output)
if layer_idx in clean_cache:
patched_output[0] = clean_cache[layer_idx]
return tuple(patched_output)
# Collect clean activations and patch at target layer
def cache_hook(module, input, output):
clean_cache[layer_idx] = output[0].detach().clone()
return output
handle1 = model.transformer.h[layer_idx].register_forward_hook(cache_hook)
clean_result = model(clean_tokens)
handle1.remove()
# Patch corrupted run with clean activations
handle2 = model.transformer.h[layer_idx].register_forward_hook(patched_forward_hook)
patched_logits = model(corrupted_tokens).logits
handle2.remove()
return {
'clean': clean_logits[0, -1],
'corrupted': corrupted_logits[0, -1],
'patched': patched_logits[0, -1]
}
Interpreting Results
The difference between patched and corrupted logits reveals causal importance. A large difference indicates the patched location is part of the harmful computation circuit.
Local verification checkpoint
Run the smallest example from this chapter in a local workspace and record the package version, runtime, data path, and observed output. If the result depends on model size, vector count, CPU/GPU backend, or available memory, note that constraint beside the exercise so the lesson remains reproducible.
Local verification checkpoint
Run the smallest example from this chapter in a local workspace and record the package version, runtime, data path, and observed output. If the result depends on model size, vector count, CPU/GPU backend, or available memory, note that constraint beside the exercise so the lesson remains reproducible.
Run a patching experiment on a small transformer model to identify which layers contribute most to refusing harmful instructions. Use the results to construct a minimal circuit diagram.