10. Feature Attribution
Feature attribution methods explain which input tokens most influence a model's output. These techniques are foundational for understanding model behavior and diagnosing safety issues.
Gradient-Based Attribution
Saliency maps rank token importance using gradient magnitudes. The gradient of the output logit with respect to each input embedding indicates how sensitive the prediction is to that token.
import torch
import torch.nn.functional as F
def compute_saliency(model, input_ids, target_token_id):
"""Compute gradient-based saliency for each input token."""
embeddings = model.model.embed_tokens(input_ids)
embeddings.requires_grad_(True)
output = model(input_ids).logits[:, -1, target_token_id]
output.backward()
saliency_scores = embeddings.grad.abs().sum(dim=-1)
return saliency_scores[0] # Batch size 1
Integrated Gradients
Gradients vanish near saturated regions of activation functions. Integrated Gradients addresses this by interpolating between a baseline and the actual input, integrating gradients along the path.
def integrated_gradients(model, input_ids, baseline_ids, steps=50):
"""Approximate integrated gradients for interpretability."""
scaled_inputs = [
baseline_ids + (input_ids - baseline_ids) * (i / steps)
for i in range(steps + 1)
]
gradients = []
for scaled_input in scaled_inputs:
embeddings = model.model.embed_tokens(scaled_input)
embeddings.requires_grad_(True)
output = model(scaled_input).logits[0, -1].sum()
output.backward()
gradients.append(embeddings.grad.detach())
avg_gradients = torch.stack(gradients).mean(dim=0)
attribution = (input_ids - baseline_ids) * avg_gradients
return attribution.sum(dim=-1)
SHAP Values
SHAP (SHapley Additive exPlanations) computes expected marginal contributions across all possible token combinations. This method provides theoretically grounded attributions but scales exponentially with sequence length.
Local verification checkpoint
Run the smallest example from this chapter in a local workspace and record the package version, runtime, data path, and observed output. If the result depends on model size, vector count, CPU/GPU backend, or available memory, note that constraint beside the exercise so the lesson remains reproducible.
Local verification checkpoint
Run the smallest example from this chapter in a local workspace and record the package version, runtime, data path, and observed output. If the result depends on model size, vector count, CPU/GPU backend, or available memory, note that constraint beside the exercise so the lesson remains reproducible.
Implement a simplified attribution method that compares model outputs with and without each input token masked. Compare the results against gradient-based saliency for a harmful query classification task.