RUNLOCALAIv38
->Will it run?Best GPUCompareTroubleshootStartLearnPulseModelsHardwareToolsBench
Run check
RUNLOCALAI

Independently operated catalog for local-AI hardware and software. Hand-written verdicts. Source-cited claims. Reproducible commands when we have them.

OP·Fredoline Eruo
DIR
  • Models
  • Hardware
  • Tools
  • Benchmarks
TOOLS
  • Will it run?
  • Compare hardware
  • Cost vs cloud
  • Choose my GPU
  • Prompting kits
  • Quick answers
REF
  • All buyer guides
  • Learn local AI
  • Methodology
  • Glossary
  • Errors KB
  • Trust
EDITOR
  • About
  • Author
  • How we make money
  • Editorial policy
  • Contact
LEGAL
  • Privacy
  • Terms
  • Sitemap
MAIL · MONTHLY DIGEST
Get monthly local AI changes
Monthly recap. No spam.
DISCLOSURE

Some links on this site are affiliate links (Amazon Associates and other first-class retailers). When you buy through them, we earn a small commission at no extra cost to you. Affiliate links do not influence our verdicts — there are cards we rate highly that we don't have affiliate relationships with, and cards that sell well that we refuse to recommend. Read more →

© 2026 runlocalai.coIndependently operated
RUNLOCALAI · v38
  1. >
  2. Home
  3. /Learn
  4. /Courses
  5. /Model Compression
  6. /Ch. 3
Model Compression

03. Pruning: Structured

Chapter 3 of 18 · 15 min
KEY INSIGHT

Structured pruning removes entire neurons, channels, or attention heads, producing dense models that benefit from standard hardware acceleration. Structured pruning eliminates whole computational units rather than individual weights. A channel pruning removes entire filter banks from convolutional layers. Head pruning removes complete attention heads from transformer architectures. Neuron pruning removes entire hidden units from feedforward layers. The result: model layers become smaller but remain dense. The dense format enables standard matrix multiplication without sparse overhead. Removing 50% of channels from a convolutional layer halves the FLOPs and memory traffic for that layer. Since no indices need tracking, inference speed improvements translate directly through the hardware's standard acceleration pathways. Channel pruning in convolutional networks illustrates the mechanism. A convolutional layer with 64 input channels and 64 output channels produces a weight tensor of shape (64, 64, K, K) where K is the kernel size. Pruning half the output channels reduces this to (64, 32, K, K). The remaining channels compute normally with dense matrix multiplication. Implementations typically combine structured and unstructured approaches. IMP (Iterative Magnitude Pruning) popularized a hierarchy: coarse structured pruning at the neuron level, followed by fine unstructured pruning within surviving neurons. This hierarchy balances hardware efficiency with compression granularity. ```python import torch import torch.nn.utils.prune as prune # Structured pruning at neuron level (remove entire columns) prune.ln_structured( model.linear_layer, name="weight", amount=0.5, n=2, # L2 norm dim=0 # prune columns (output neurons) ) # Structured pruning at filter level (remove entire kernels) prune.ln_structured( model.conv_layer, name="weight", amount=0.5, n=2,, dim=0 # prune filter indices (output channels) ) ``` A common failure occurs when structured pruning reduces layer dimensions inconsistently. If one layer prunes 30% of channels but the next layer prunes 60%, dimension mismatches arise. Maintaining alignment requires coordinating pruning decisions across layer boundaries or using adaptation layers that project between mismatched dimensions.

EXERCISE

Implement structured channel pruning on a simple CNN using L2 norm across channels. Verify that layer output shapes remain valid by passing a sample batch through pruned layers.

← Chapter 2
Pruning: Unstructured
Chapter 4 →
Magnitude Pruning