Benchmarking Multimodal — Advanced Multi-Modal Systems (Chapter 19)

Systematic benchmarking enables comparison across model variants, hardware configurations, and optimization techniques. Effective benchmarks isolate specific performance characteristics while reflecting real-world usage patterns.

Benchmark design must distinguish between throughput (total work per unit time) and latency (time per unit work). Video streaming requires low latency, so P99 latency matters more than average throughput. Offline batch processing prioritizes throughput, where latency is irrelevant.

import time
import statistics

def benchmark_inference(model, input_batch, num_iterations=1000, warmup=100):
    # Warmup
    for _ in range(warmup):
        model(input_batch)
    
    latencies = []
    for _ in range(num_iterations):
        torch.cuda.synchronize()
        start = time.perf_counter()
        model(input_batch)
        torch.cuda.synchronize()
        latencies.append(time.perf_counter() - start)
    
    return {
        'mean_ms': statistics.mean(latencies) * 1000,
        'p50_ms': statistics.median(latencies) * 1000,
        'p95_ms': statistics.quantiles(latencies, n=20)[18] * 1000,
        'p99_ms': statistics.quantiles(latencies, n=100)[98] * 1000,
        'throughput_fps': len(input_batch) / statistics.mean(latencies)
    }

Video-specific benchmarks must include temporal input variations. A model that performs well on 16-frame clips may degrade significantly on 128-frame clips. Sweeping sequence length reveals architectural limitations and memory pressure points.

Hardware-in-the-loop benchmarking captures power consumption, thermal throttling, and memory bandwidth saturation that pure software benchmarks miss. Running extended benchmarks (30+ minutes) reveals thermal throttling behavior that short benchmarks miss.

Local verification checkpoint

Run the smallest example from this chapter in a local workspace and record the package version, runtime, data path, and observed output. If the result depends on model size, vector count, CPU/GPU backend, or available memory, note that constraint beside the exercise so the lesson remains reproducible.