16. Quantization for Vision

Chapter 16 of 18 · 15 min

KEY INSIGHT

Vision model quantization reduces weight precision (typically from float32 to int8) enabling larger models on limited hardware. Accuracy trade-offs vary by visual domainΓÇöcomplex scenes suffer more than simple classification tasks. Quantization works differently across model components. Weight quantization affects storage and memory bandwidth. Activation quantization requires careful calibration to avoid overflow while preserving semantic signal. ```python import torch from torchvision.models import efficientnet class QuantizedVisionModel: def __init__(self, model_name: str = "efficientnet_b0"): self.base_model = efficientnet_efficientnet_b0(weights=None) self.quantized_model = None def apply_dynamic_quantization(self): """ Dynamic quantization: weights stored as int8, computation happens in fp32. Fast conversion, moderate memory savings. """ self.quantized_model = torch.quantization.quantize_dynamic( self.base_model, {torch.nn.Linear, torch.nn.Conv2d}, dtype=torch.qint8 ) return self.quantized_model def apply_static_quantization(self): """ Static quantization: calibrate with representative data, then convert both weights and activations to int8. Best memory savings but requires calibration dataset. """ # Prepare model for static quantization self.base_model.train() self.base_model.qconfig = torch.quantization.default_qconfig torch.quantization.prepare(self.base_model, inplace=True) # Calibrate with representative images calibration_loader = [ torch.randn(1, 3, 224, 224) for _ in range(32) ] with torch.no_grad(): for cal_batch in calibration_loader: self.base_model(cal_batch) # Convert to quantized model self.quantized_model = torch.quantization.convert( self.base_model, inplace=False ) return self.quantized_model def benchmark_inference( self, model: torch.nn.Module, input_tensor: torch.Tensor, iterations: int = 100 ) -> dict: """Measure latency and memory for model inference""" import time import gc model.eval() # Warmup for _ in range(10): model(input_tensor) # Time iterations if torch.cuda.is_available(): torch.cuda.synchronize() start = time.perf_counter() for _ in range(iterations): with torch.no_grad(): model(input_tensor) if torch.cuda.is_available(): torch.cuda.synchronize() end = time.perf_counter() avg_latency_ms = (end - start) / iterations * 1000 # Memory usage if torch.cuda.is_available(): memory_mb = torch.cuda.max_memory_allocated() / (1024 ** 2) else: import sys memory_mb = sys.getsizeof(model.state_dict()) / (1024 ** 2) return { "avg_latency_ms": avg_latency_ms, "memory_mb": memory_mb } class VisionModelCompressor: """Compress vision models for edge deployment""" def __init__(self): self.pruning_threshold = 0.01 def prune_filters( self, model: torch.nn.Module, importance_metric: callable ) -> torch.nn.Module: """ Remove filters with low importance scores. Importance can be based on activation statistics or gradient magnitudes. """ for name, module in model.named_modules(): if isinstance(module, torch.nn.Conv2d): # Compute filter importance weights = module.weight.detach() importance = importance_metric(weights) # Create mask for important filters mask = importance > self.pruning_threshold # Zero out unimportant filters module.weight.data *= mask.unsqueeze(-1).unsqueeze(-1) return model ``` **Failure Modes:** - Static quantization accuracy collapse when calibration set lacks diversity. Use representative dataset spanning input distribution. - Quantization breaks models relying on precise thresholds (object detection with score cutoffs). Test threshold-dependent logic post-quantization. - Asymmetric quantization ranges causing activation overflow. Monitor for NaN/Inf outputs.

EXERCISE

Compare inference latency, memory usage, and accuracy between float32, dynamic quantized, and static quantized versions of a vision model on a test dataset. Identify where accuracy degrades.