05. Movement Pruning

Chapter 5 of 18 · 15 min

KEY INSIGHT

Movement pruning removes weights that remain small throughout training, identifying parameters whose contribution decreases during optimization rather than at a static checkpoint. Standard magnitude pruning evaluates weights at a single checkpoint—typically after training completes. Movement pruning tracks weight magnitudes across training, identifying weights that begin small and stay small. These weights never contribute meaningfully to the network's learned function. The movement score measures how consistently a weight stays small. Weights that spike during training and return to small values demonstrate dynamic contribution. Weights that remain small throughout indicate persistent dormancy. Movement pruning removes the latter, preserving weights with time-varying importance. ```python class MovementPruner: """ Tracks weight movements across training to identify consistently small weights. """ def __init__(self, model, beta=0.9): self.model = model self.movement_scores = {} self.beta = beta # Exponential moving average decay self._register_hooks() def _register_hooks(self): """Register forward hooks to track weight magnitudes.""" for name, module in self.model.named_modules(): if hasattr(module, 'weight'): self.movement_scores[name] = torch.zeros_like(module.weight) def update_scores(self): """Update running movement scores with current magnitudes.""" for name, module in self.model.named_modules(): if hasattr(module, 'weight') and module.weight is not None: magnitude = module.weight.abs() self.movement_scores[name] = ( self.beta * self.movement_scores[name] + (1 - self.beta) * magnitude ) def prune(self, sparsity): """Prune weights with lowest movement scores.""" for name, module in self.model.named_modules(): if hasattr(module, 'weight') and name in self.movement_scores: scores = self.movement_scores[name] threshold = torch.quantile(scores.flatten(), sparsity) mask = scores > threshold module.weight.data = module.weight.data * mask.float() ``` Movement pruning offers several advantages over magnitude pruning. First, it identifies weights with consistently low contribution rather than those that happen to be small at evaluation time. Second, it tolerates weight magnifications during training that might later revert. Third, the movement pattern itself provides information about weight importance. The computational overhead of movement tracking remains modest. After each training step, the pruner updates exponential moving averages of weight magnitudes. No forward passes beyond those already required for training are needed. The scoring happens during the normal training loop. A failure mode appears when training hyperparameters interact poorly with movement scores. High learning rates cause weights to fluctuate more, reducing the signal-to-noise ratio in movement scores. Very low learning rates cause weights to move less, potentially misclassifying important weights as unimportant. Movement pruning works best with stable training dynamics.

EXERCISE

Implement movement pruning tracking for a transformer model during training. After convergence, compare which weights get pruned under movement versus magnitude criteria. Identify where the criteria disagree.