17. Quantization for Video
Quantization reduces model memory footprint and inference latency by representing weights and activations with lower precision data types. For video processing, quantization often provides the latency reduction needed to meet real-time requirements.
Post-training quantization requires a calibration dataset to determine scaling factors for activation ranges. Without careful calibration, quantization introduces accuracy degradation that varies across input distributions. Video data with high motion variation needs calibration samples spanning the full input range.
import torch.quantization as tq
# Dynamic quantization (weights only, for LSTM/transformers)
model_quantized = torch.quantization.quantize_dynamic(
model, {torch.nn.Linear, torch.nn.LSTM},
dtype=torch.qint8
)
# Static quantization (full, requires calibration)
model.qconfig = tq.get_default_qconfig('fbgemm')
torch.quantization.prepare(model, inplace=True)
# Calibrate with representative dataset
calibrate(model, calibration_data)
torch.quantization.convert(model, inplace=True)
INT8 quantization typically provides 2-4x speedup over FP32 with 4x memory reduction. However, video preprocessing operations (resize, normalize, color space conversion) often remain in FP32, creating type conversion overhead. Ensuring preprocessing stays in INT8 throughout requires careful operator implementation.
Mixed precision quantization applies different precision levels to different model components. Compute-intensive operations like convolutions benefit most from INT8, while sensitive operations like normalization may require FP16 or FP32. Automatic mixed precision (AMP) in PyTorch handles this selection dynamically.
Local verification checkpoint
Run the smallest example from this chapter in a local workspace and record the package version, runtime, data path, and observed output. If the result depends on model size, vector count, CPU/GPU backend, or available memory, note that constraint beside the exercise so the lesson remains reproducible.
Local verification checkpoint
Run the smallest example from this chapter in a local workspace and record the package version, runtime, data path, and observed output. If the result depends on model size, vector count, CPU/GPU backend, or available memory, note that constraint beside the exercise so the lesson remains reproducible.
Apply dynamic quantization to a video classification model. Compare inference speed and accuracy on a video test set against FP32 baseline. Document any accuracy degradation.