RUNLOCALAIv38
->Will it run?Best GPUCompareTroubleshootStartLearnPulseModelsHardwareToolsBench
Run check
RUNLOCALAI

Independently operated catalog for local-AI hardware and software. Hand-written verdicts. Source-cited claims. Reproducible commands when we have them.

OP·Fredoline Eruo
DIR
  • Models
  • Hardware
  • Tools
  • Benchmarks
TOOLS
  • Will it run?
  • Compare hardware
  • Cost vs cloud
  • Choose my GPU
  • Prompting kits
  • Quick answers
REF
  • All buyer guides
  • Learn local AI
  • Methodology
  • Glossary
  • Errors KB
  • Trust
EDITOR
  • About
  • Author
  • How we make money
  • Editorial policy
  • Contact
LEGAL
  • Privacy
  • Terms
  • Sitemap
MAIL · MONTHLY DIGEST
Get monthly local AI changes
Monthly recap. No spam.
DISCLOSURE

Some links on this site are affiliate links (Amazon Associates and other first-class retailers). When you buy through them, we earn a small commission at no extra cost to you. Affiliate links do not influence our verdicts — there are cards we rate highly that we don't have affiliate relationships with, and cards that sell well that we refuse to recommend. Read more →

© 2026 runlocalai.coIndependently operated
RUNLOCALAI · v38
  1. >
  2. Home
  3. /Learn
  4. /Courses
  5. /AI Safety and Alignment
  6. /Ch. 10
AI Safety and Alignment

10. Feature Attribution

Chapter 10 of 18 · 15 min
KEY INSIGHT

Feature attribution transforms opaque neural computations into human-interpretable rankings. No single method dominates—gradient methods are fast but imprecise, while SHAP values are principled but computationally expensive.

Feature attribution methods explain which input tokens most influence a model's output. These techniques are foundational for understanding model behavior and diagnosing safety issues.

Gradient-Based Attribution

Saliency maps rank token importance using gradient magnitudes. The gradient of the output logit with respect to each input embedding indicates how sensitive the prediction is to that token.

import torch
import torch.nn.functional as F

def compute_saliency(model, input_ids, target_token_id):
    """Compute gradient-based saliency for each input token."""
    embeddings = model.model.embed_tokens(input_ids)
    embeddings.requires_grad_(True)
    
    output = model(input_ids).logits[:, -1, target_token_id]
    output.backward()
    
    saliency_scores = embeddings.grad.abs().sum(dim=-1)
    return saliency_scores[0]  # Batch size 1

Integrated Gradients

Gradients vanish near saturated regions of activation functions. Integrated Gradients addresses this by interpolating between a baseline and the actual input, integrating gradients along the path.

def integrated_gradients(model, input_ids, baseline_ids, steps=50):
    """Approximate integrated gradients for interpretability."""
    scaled_inputs = [
        baseline_ids + (input_ids - baseline_ids) * (i / steps)
        for i in range(steps + 1)
    ]
    
    gradients = []
    for scaled_input in scaled_inputs:
        embeddings = model.model.embed_tokens(scaled_input)
        embeddings.requires_grad_(True)
        output = model(scaled_input).logits[0, -1].sum()
        output.backward()
        gradients.append(embeddings.grad.detach())
    
    avg_gradients = torch.stack(gradients).mean(dim=0)
    attribution = (input_ids - baseline_ids) * avg_gradients
    return attribution.sum(dim=-1)

SHAP Values

SHAP (SHapley Additive exPlanations) computes expected marginal contributions across all possible token combinations. This method provides theoretically grounded attributions but scales exponentially with sequence length.

Local verification checkpoint

Run the smallest example from this chapter in a local workspace and record the package version, runtime, data path, and observed output. If the result depends on model size, vector count, CPU/GPU backend, or available memory, note that constraint beside the exercise so the lesson remains reproducible.

Local verification checkpoint

Run the smallest example from this chapter in a local workspace and record the package version, runtime, data path, and observed output. If the result depends on model size, vector count, CPU/GPU backend, or available memory, note that constraint beside the exercise so the lesson remains reproducible.

EXERCISE

Implement a simplified attribution method that compares model outputs with and without each input token masked. Compare the results against gradient-based saliency for a harmful query classification task.

← Chapter 9
Attention Visualization
Chapter 11 →
Activation Patching