Kernel Benchmarking — Custom Quantization and Kernels (Chapter 14)

Systematic benchmarking reveals performance characteristics and guides optimization decisions. Environment variables and measurement methodology significantly impact results.

Benchmark Infrastructure

import torch
import numpy as np
import time

def benchmark_kernel(func, *args, warmup=50, iterations=1000, 
                     synchronize=True, use_cuda=True):
    """Measure kernel execution time with statistical reporting."""
    
    if synchronize and use_cuda:
        torch.cuda.synchronize()
    
    # Warmup phase
    for _ in range(warmup):
        func(*args)
    
    if synchronize and use_cuda:
        torch.cuda.synchronize()
    
    # Timed phase
    times = []
    for _ in range(iterations):
        start = time.perf_counter()
        func(*args)
        if synchronize and use_cuda:
            torch.cuda.synchronize()
        end = time.perf_counter()
        times.append((end - start) * 1e3)  # Convert to ms
    
    return {
        'mean': np.mean(times),
        'std': np.std(times),
        'min': np.min(times),
        'max': np.max(times),
        'p50': np.percentile(times, 50),
        'p99': np.percentile(times, 99)
    }

def benchmark_throughput(func, shape, dtype, device='cuda:0'):
    """Measure arithmetic throughput in TOPS."""
    M, N, K = shape
    
    # FLOP count for GEMM: 2 * M * N * K
    flops = 2 * M * N * K
    
    result = benchmark_kernel(func)
    time_us = result['mean'] * 1e3
    
    tops = flops / (time_us * 1e-6) / 1e12
    return tops

# Example usage
shapes = [(1024, 1024, 1024), (2048, 2048, 2048), (4096, 4096, 4096)]
for shape in shapes:
    M, N, K = shape
    times = benchmark_kernel(gemm_kernel, M, N, K, ...)
    print(f"Shape {shape}: {times['mean']:.2f} ms ± {times['std']:.2f}")

Throughput and Latency Profiles

def profile_kernel_memory(ncu_path, kernel_binary, args):
    """Profile memory bandwidth utilization with ncu."""
    cmd = [
        ncu_path, '--metrics', 'dram_throughput,sm_throughput',
        '--set', 'full', kernel_binary, *args
    ]
    result = subprocess.run(cmd, capture_output=True, text=True)
    return parse_ncu_output(result.stdout)

Local verification checkpoint

Run the smallest example from this chapter in a local workspace and record the package version, runtime, data path, and observed output. If the result depends on model size, vector count, CPU/GPU backend, or available memory, note that constraint beside the exercise so the lesson remains reproducible.