18. Model Compression Pipeline Project

Chapter 18 of 18 · 30 min

KEY INSIGHT

Building an end-to-end compression pipeline requires integrating multiple techniques, handling edge cases, and validating results at each stage to produce production-ready compressed models. This final chapter guides you through building a complete model compression pipeline that applies pruning, distillation, and quantization in a coordinated workflow to compress a real model. ### Project Overview You will compress a ResNet-18 model for image classification, targeting: - 75% reduction in model size - Less than 2% accuracy drop from baseline (75.1% top-1 on ImageNet subset) - Inference latency under 5ms on target hardware ### Starter Code ```python import torch import torch.nn as nn import torch.nn.functional as F import torchvision.models as models from torchvision import transforms from torch.utils.data import DataLoader class ModelCompressionPipeline: def __init__(self, model, config): self.model = model self.config = config self.history = [] def run(self, train_loader, val_loader, test_loader): """ Execute the full compression pipeline. """ print("=" * 60) print("Starting Model Compression Pipeline") print("=" * 60) # Stage 1: Structured Pruning print("\n[Stage 1] Structured Pruning") self.model = self.apply_structured_pruning( self.model, train_loader, val_loader, sparsity=self.config.pruning_sparsity ) self._evaluate("after_pruning", test_loader) # Stage 2: Knowledge Distillation print("\n[Stage 2] Knowledge Distillation") teacher_model = self._create_teacher_copy() self.model = self.knowledge_distillation( self.model, teacher_model, train_loader, val_loader, temperature=self.config.distill_temperature ) self._evaluate("after_distillation", test_loader) # Stage 3: Quantization print("\n[Stage 3] Quantization") self.model = self.quantize_model( self.model, train_loader, val_loader, target_bits=self.config.target_bits ) self._evaluate("after_quantization", test_loader) print("\n" + "=" * 60) print("Pipeline Complete") self._print_summary() print("=" * 60) return self.model ``` ### Stage 1: Structured Pruning ```python def apply_structured_pruning(self, model, train_loader, val_loader, sparsity): """ Apply structured channel pruning based on activation statistics. """ # Compute channel importance using Taylor method importance = self._compute_channel_importance(model, train_loader) # Determine pruning thresholds per layer thresholds = self._compute_pruning_thresholds(importance, sparsity) # Create pruning masks masks = self._create_structured_masks(model, importance, thresholds) # Apply pruning masks model = self._apply_masks(model, masks) # Recovery fine-tuning model = self._finetune_recovery(model, train_loader, val_loader, epochs=5) return model def _compute_channel_importance(self, model, train_loader): """ Compute channel importance using first-order Taylor approximation. """ model.eval() importance = {} # Hook to capture gradients gradients = {} def compute_grad(name): def hook(grad): gradients[name] = grad return hook hooks = [] for name, module in model.named_modules(): if isinstance(module, nn.Conv2d): handle = module.weight.register_hook(compute_grad(name)) hooks.append(handle) # Collect importance metrics importance_sum = {} for batch in train_loader: inputs, targets = batch outputs = model(inputs) # Use magnitude of gradients as importance loss = F.cross_entropy(outputs, targets) loss.backward() for name, module in model.named_modules(): if isinstance(module, nn.Conv2d): grad = gradients.get(name) if grad is not None: imp = grad.abs().mean(dim=(1, 3)) # Per-channel mean if name not in importance_sum: importance_sum[name] = imp else: importance_sum[name] += imp # Clean up hooks for handle in hooks: handle.remove() # Normalize importance for name in importance_sum: imp = importance_sum[name] importance[name] = imp / (imp.sum() + 1e-8) return importance ``` ### Stage 2: Knowledge Distillation ```python def knowledge_distillation(self, student_model, teacher_model, train_loader, val_loader, temperature=4.0): """ Distill knowledge from teacher to student with soft targets. """ student_model.train() teacher_model.eval() optimizer = torch.optim.Adam(student_model.parameters(), lr=1e-4) for epoch in range(15): epoch_loss = 0 for batch in train_loader: inputs, targets = batch # Teacher predictions with torch.no_grad(): teacher_outputs = teacher_model(inputs) soft_targets = F.softmax(teacher_outputs / temperature, dim=-1) # Student predictions student_outputs = student_model(inputs) # Distillation loss distill_loss = F.kl_div( F.log_softmax(student_outputs / temperature, dim=-1), soft_targets, reduction='batchmean' ) * (temperature ** 2) # Hard target loss hard_loss = F.cross_entropy(student_outputs, targets) # Combined loss loss = 0.7 * distill_loss + 0.3 * hard_loss optimizer.zero_grad() loss.backward() optimizer.step() epoch_loss += loss.item() # Validate val_acc = self._validate(student_model, val_loader) print(f" Epoch {epoch+1}: loss={epoch_loss/len(train_loader):.4f}, val_acc={val_acc:.4f}") student_model.eval() return student_model ``` ### Stage 3: Quantization ```python def quantize_model(self, model, train_loader, val_loader, target_bits=8): """ Apply quantization-aware training for specified bit width. """ import torch.quantization as tq # Prepare model for quantization model.qconfig = tq.get_default_qconfig('fbgemm') model.prepare_qat() # Fine-tune with quantization optimizer = torch.optim.SGD(model.parameters(), lr=1e-3, momentum=0.9) for epoch in range(10): model.train() for batch in train_loader: inputs, targets = batch optimizer.zero_grad() outputs = model(inputs) loss = F.cross_entropy(outputs, targets) loss.backward() optimizer.step() model.eval() acc = self._validate(model, val_loader) print(f" QAT Epoch {epoch+1}: val_acc={acc:.4f}") # Convert to quantized model quantized_model = torch.quantization.convert(model) return quantized_model ``` ### Evaluation and Reporting ```python def _evaluate(self, stage_name, test_loader): """Evaluate model and record results.""" model = self.model model.eval() correct = 0 total = 0 with torch.no_grad(): for batch in test_loader: inputs, targets = batch outputs = model(inputs) preds = outputs.argmax(dim=1) correct += (preds == targets).sum().item() total += targets.shape[0] accuracy = correct / total size_mb = self._compute_model_size() / 1e6 self.history.append({ 'stage': stage_name, 'accuracy': accuracy, 'size_mb': size_mb }) print(f" Accuracy: {accuracy:.4f}, Size: {size_mb:.2f}MB") def _compute_model_size(self): """Calculate model size in bytes.""" param_size = 0 for param in self.model.parameters(): param_size += param.numel() * param.element_size() return param_size def _print_summary(self): """Print final compression summary.""" print("\nCompression Summary:") print("-" * 40) for record in self.history: print(f" {record['stage']:<25} | Acc: {record['accuracy']:.4f} | Size: {record['size_mb']:.2f}MB") baseline_acc = self.history[0]['accuracy'] final_acc = self.history[-1]['accuracy'] size_reduction = self.history[0]['size_mb'] / self.history[-1]['size_mb'] print("-" * 40) print(f"Accuracy drop: {(baseline_acc - final_acc)*100:.2f}%") print(f"Size reduction: {size_reduction:.2f}x") ``` ### Running the Pipeline ```python def main(): # Load model model = models.resnet18(pretrained=True) # Configuration config = { 'pruning_sparsity': 0.5, # Remove 50% of channels 'distill_temperature': 4.0, 'target_bits': 8 } pipeline = ModelCompressionPipeline(model, config) # Load data (using small subset for demonstration) train_loader = DataLoader(train_dataset, batch_size=32) val_loader = DataLoader(val_dataset, batch_size=64) test_loader = DataLoader(test_dataset, batch_size=64) # Run pipeline compressed_model = pipeline.run(train_loader, val_loader, test_loader) # Export torch.save(compressed_model.state_dict(), 'compressed_resnet18.pt') return compressed_model ``` ### Exercise

Completion Summary

You have completed all 18 chapters of the Model Compression course. You now understand:

How pruning removes redundant weights and structures
How knowledge distillation transfers learned representations
How quantization reduces numerical precision
How to combine these techniques in effective pipelines
How to evaluate and deploy compressed models in production

Next Steps:

Apply these techniques to your own models
Benchmark compression results on your target hardware
Integrate monitoring to detect accuracy drift
Iterate on your compression pipeline based on production feedback

For additional resources and support, visit the operator documentation portal.

EXERCISE

Modify the pipeline to achieve at least 80% size reduction with less than 3% accuracy drop by:

Experimenting with different pruning sparsity levels (0.4, 0.5, 0.6, 0.7)
Testing different distillation temperatures (2, 4, 6, 8)
Trying 4-bit quantization instead of 8-bit
Implementing layer-wise bit allocation based on layer sensitivity

Plot the Pareto frontier of your experiments and identify the configuration that best balances size and accuracy for your deployment constraints.

Completion Summary

You have completed all 18 chapters of the Model Compression course. You now understand:

How pruning removes redundant weights and structures
How knowledge distillation transfers learned representations
How quantization reduces numerical precision
How to combine these techniques in effective pipelines
How to evaluate and deploy compressed models in production

Next Steps:

Apply these techniques to your own models
Benchmark compression results on your target hardware
Integrate monitoring to detect accuracy drift
Iterate on your compression pipeline based on production feedback

For additional resources and support, visit the operator documentation portal.