Feature Attribution — AI Safety and Alignment (Chapter 10)

Feature attribution methods explain which input tokens most influence a model's output. These techniques are foundational for understanding model behavior and diagnosing safety issues.

Gradient-Based Attribution

Saliency maps rank token importance using gradient magnitudes. The gradient of the output logit with respect to each input embedding indicates how sensitive the prediction is to that token.

import torch
import torch.nn.functional as F

def compute_saliency(model, input_ids, target_token_id):
    """Compute gradient-based saliency for each input token."""
    embeddings = model.model.embed_tokens(input_ids)
    embeddings.requires_grad_(True)
    
    output = model(input_ids).logits[:, -1, target_token_id]
    output.backward()
    
    saliency_scores = embeddings.grad.abs().sum(dim=-1)
    return saliency_scores[0]  # Batch size 1

Integrated Gradients

Gradients vanish near saturated regions of activation functions. Integrated Gradients addresses this by interpolating between a baseline and the actual input, integrating gradients along the path.

def integrated_gradients(model, input_ids, baseline_ids, steps=50):
    """Approximate integrated gradients for interpretability."""
    scaled_inputs = [
        baseline_ids + (input_ids - baseline_ids) * (i / steps)
        for i in range(steps + 1)
    ]
    
    gradients = []
    for scaled_input in scaled_inputs:
        embeddings = model.model.embed_tokens(scaled_input)
        embeddings.requires_grad_(True)
        output = model(scaled_input).logits[0, -1].sum()
        output.backward()
        gradients.append(embeddings.grad.detach())
    
    avg_gradients = torch.stack(gradients).mean(dim=0)
    attribution = (input_ids - baseline_ids) * avg_gradients
    return attribution.sum(dim=-1)

SHAP Values

SHAP (SHapley Additive exPlanations) computes expected marginal contributions across all possible token combinations. This method provides theoretically grounded attributions but scales exponentially with sequence length.

Local verification checkpoint

Run the smallest example from this chapter in a local workspace and record the package version, runtime, data path, and observed output. If the result depends on model size, vector count, CPU/GPU backend, or available memory, note that constraint beside the exercise so the lesson remains reproducible.