16. Performance Optimization

Chapter 16 of 24 · 15 min

Optimizing video multimodal pipelines requires systematic profiling to identify bottlenecks. Sequential optimization of the wrong components wastes effort; always profile before optimizing.

Batch processing multiple frames simultaneously improves GPU utilization when temporal dependencies allow independent processing. The key constraint: batching increases latency. A batch of 8 frames might achieve 3x throughput improvement but adds 8 frames of latency. For streaming systems, batching works only when input buffers can absorb the delay.

def batch_inference(model, frame_queue, batch_size=4, timeout=0.033):
    """Collect frames into batches with timeout-based release"""
    batch = []
    start_time = time.time()
    
    while len(batch) < batch_size:
        elapsed = time.time() - start_time
        remaining = timeout - elapsed
        
        try:
            frame = frame_queue.get(timeout=remaining)
            batch.append(frame)
        except queue.Empty:
            break
    
    if batch:
        batch_tensor = torch.stack(batch).cuda()
        with torch.no_grad():
            outputs = model(batch_tensor)
        return outputs, batch
    return None, []

CUDA stream parallelism enables overlapping independent operations. Decode, preprocessing, and inference can execute concurrently when properly synchronized. The critical requirement: non-blocking async operations and explicit stream synchronization at necessary points.

Operator fusion reduces memory traffic by combining sequential operations. PyTorch's torch.jit.fuse can fuse element-wise operations with convolutions, eliminating intermediate tensor allocation. For custom operators, writing fused CUDA kernels provides maximum control but requires significant development effort.

Local verification checkpoint

Run the smallest example from this chapter in a local workspace and record the package version, runtime, data path, and observed output. If the result depends on model size, vector count, CPU/GPU backend, or available memory, note that constraint beside the exercise so the lesson remains reproducible.

Local verification checkpoint

Run the smallest example from this chapter in a local workspace and record the package version, runtime, data path, and observed output. If the result depends on model size, vector count, CPU/GPU backend, or available memory, note that constraint beside the exercise so the lesson remains reproducible.

EXERCISE

Optimize a video inference pipeline by adding batching and CUDA stream parallelism. Measure throughput improvement and latency increase. Find the optimal batch size for your target use case.