03. Pruning: Structured

Chapter 3 of 18 · 15 min

KEY INSIGHT

Structured pruning removes entire neurons, channels, or attention heads, producing dense models that benefit from standard hardware acceleration. Structured pruning eliminates whole computational units rather than individual weights. A channel pruning removes entire filter banks from convolutional layers. Head pruning removes complete attention heads from transformer architectures. Neuron pruning removes entire hidden units from feedforward layers. The result: model layers become smaller but remain dense. The dense format enables standard matrix multiplication without sparse overhead. Removing 50% of channels from a convolutional layer halves the FLOPs and memory traffic for that layer. Since no indices need tracking, inference speed improvements translate directly through the hardware's standard acceleration pathways. Channel pruning in convolutional networks illustrates the mechanism. A convolutional layer with 64 input channels and 64 output channels produces a weight tensor of shape (64, 64, K, K) where K is the kernel size. Pruning half the output channels reduces this to (64, 32, K, K). The remaining channels compute normally with dense matrix multiplication. Implementations typically combine structured and unstructured approaches. IMP (Iterative Magnitude Pruning) popularized a hierarchy: coarse structured pruning at the neuron level, followed by fine unstructured pruning within surviving neurons. This hierarchy balances hardware efficiency with compression granularity. ```python import torch import torch.nn.utils.prune as prune # Structured pruning at neuron level (remove entire columns) prune.ln_structured( model.linear_layer, name="weight", amount=0.5, n=2, # L2 norm dim=0 # prune columns (output neurons) ) # Structured pruning at filter level (remove entire kernels) prune.ln_structured( model.conv_layer, name="weight", amount=0.5, n=2,, dim=0 # prune filter indices (output channels) ) ``` A common failure occurs when structured pruning reduces layer dimensions inconsistently. If one layer prunes 30% of channels but the next layer prunes 60%, dimension mismatches arise. Maintaining alignment requires coordinating pruning decisions across layer boundaries or using adaptation layers that project between mismatched dimensions.

EXERCISE

Implement structured channel pruning on a simple CNN using L2 norm across channels. Verify that layer output shapes remain valid by passing a sample batch through pruned layers.