RUNLOCALAIv38
->Will it run?Best GPUCompareTroubleshootStartLearnPulseModelsHardwareToolsBench
Run check
RUNLOCALAI

Independently operated catalog for local-AI hardware and software. Hand-written verdicts. Source-cited claims. Reproducible commands when we have them.

OP·Fredoline Eruo
DIR
  • Models
  • Hardware
  • Tools
  • Benchmarks
TOOLS
  • Will it run?
  • Compare hardware
  • Cost vs cloud
  • Choose my GPU
  • Prompting kits
  • Quick answers
REF
  • All buyer guides
  • Learn local AI
  • Methodology
  • Glossary
  • Errors KB
  • Trust
EDITOR
  • About
  • Author
  • How we make money
  • Editorial policy
  • Contact
LEGAL
  • Privacy
  • Terms
  • Sitemap
MAIL · MONTHLY DIGEST
Get monthly local AI changes
Monthly recap. No spam.
DISCLOSURE

Some links on this site are affiliate links (Amazon Associates and other first-class retailers). When you buy through them, we earn a small commission at no extra cost to you. Affiliate links do not influence our verdicts — there are cards we rate highly that we don't have affiliate relationships with, and cards that sell well that we refuse to recommend. Read more →

© 2026 runlocalai.coIndependently operated
RUNLOCALAI · v38
  1. >
  2. Home
  3. /Learn
  4. /Courses
  5. /AI Safety and Alignment
  6. /Ch. 11
AI Safety and Alignment

11. Activation Patching

Chapter 11 of 18 · 15 min
KEY INSIGHT

Activation patching separates correlation from causation. By surgically replacing activations, you discover which circuits are genuinely responsible for an output versus which merely correlate with it.

Activation patching (also known as causal mediation analysis or path patching) measures the causal effect of specific model components on outputs. By patching activations from a corrupted run into a clean run, you isolate which circuits carry harmful information.

Mechanics of Patching

def activation_patch_experiment(
    clean_tokens, corrupted_tokens, model, layer_idx, position_idx
):
    """Patch activations at a specific layer and position."""
    # Run clean and corrupted forward passes
    clean_logits = model(clean_tokens).logits
    corrupted_logits = model(corrupted_tokens).logits
    
    # Cache clean activations
    clean_cache = {}
    
    def patched_forward_hook(module, input, output):
        patched_output = list(output)
        if layer_idx in clean_cache:
            patched_output[0] = clean_cache[layer_idx]
        return tuple(patched_output)
    
    # Collect clean activations and patch at target layer
    def cache_hook(module, input, output):
        clean_cache[layer_idx] = output[0].detach().clone()
        return output
    
    handle1 = model.transformer.h[layer_idx].register_forward_hook(cache_hook)
    clean_result = model(clean_tokens)
    handle1.remove()
    
    # Patch corrupted run with clean activations
    handle2 = model.transformer.h[layer_idx].register_forward_hook(patched_forward_hook)
    patched_logits = model(corrupted_tokens).logits
    handle2.remove()
    
    return {
        'clean': clean_logits[0, -1],
        'corrupted': corrupted_logits[0, -1],
        'patched': patched_logits[0, -1]
    }

Interpreting Results

The difference between patched and corrupted logits reveals causal importance. A large difference indicates the patched location is part of the harmful computation circuit.

Local verification checkpoint

Run the smallest example from this chapter in a local workspace and record the package version, runtime, data path, and observed output. If the result depends on model size, vector count, CPU/GPU backend, or available memory, note that constraint beside the exercise so the lesson remains reproducible.

Local verification checkpoint

Run the smallest example from this chapter in a local workspace and record the package version, runtime, data path, and observed output. If the result depends on model size, vector count, CPU/GPU backend, or available memory, note that constraint beside the exercise so the lesson remains reproducible.

EXERCISE

Run a patching experiment on a small transformer model to identify which layers contribute most to refusing harmful instructions. Use the results to construct a minimal circuit diagram.

← Chapter 10
Feature Attribution
Chapter 12 →
Bias Detection