Activation Patching — AI Safety and Alignment (Chapter 11)

Activation patching (also known as causal mediation analysis or path patching) measures the causal effect of specific model components on outputs. By patching activations from a corrupted run into a clean run, you isolate which circuits carry harmful information.

Mechanics of Patching

def activation_patch_experiment(
    clean_tokens, corrupted_tokens, model, layer_idx, position_idx
):
    """Patch activations at a specific layer and position."""
    # Run clean and corrupted forward passes
    clean_logits = model(clean_tokens).logits
    corrupted_logits = model(corrupted_tokens).logits
    
    # Cache clean activations
    clean_cache = {}
    
    def patched_forward_hook(module, input, output):
        patched_output = list(output)
        if layer_idx in clean_cache:
            patched_output[0] = clean_cache[layer_idx]
        return tuple(patched_output)
    
    # Collect clean activations and patch at target layer
    def cache_hook(module, input, output):
        clean_cache[layer_idx] = output[0].detach().clone()
        return output
    
    handle1 = model.transformer.h[layer_idx].register_forward_hook(cache_hook)
    clean_result = model(clean_tokens)
    handle1.remove()
    
    # Patch corrupted run with clean activations
    handle2 = model.transformer.h[layer_idx].register_forward_hook(patched_forward_hook)
    patched_logits = model(corrupted_tokens).logits
    handle2.remove()
    
    return {
        'clean': clean_logits[0, -1],
        'corrupted': corrupted_logits[0, -1],
        'patched': patched_logits[0, -1]
    }

Interpreting Results

The difference between patched and corrupted logits reveals causal importance. A large difference indicates the patched location is part of the harmful computation circuit.

Local verification checkpoint

Run the smallest example from this chapter in a local workspace and record the package version, runtime, data path, and observed output. If the result depends on model size, vector count, CPU/GPU backend, or available memory, note that constraint beside the exercise so the lesson remains reproducible.