14. Kernel Benchmarking
Chapter 14 of 18 · 15 min
Systematic benchmarking reveals performance characteristics and guides optimization decisions. Environment variables and measurement methodology significantly impact results.
Benchmark Infrastructure
import torch
import numpy as np
import time
def benchmark_kernel(func, *args, warmup=50, iterations=1000,
synchronize=True, use_cuda=True):
"""Measure kernel execution time with statistical reporting."""
if synchronize and use_cuda:
torch.cuda.synchronize()
# Warmup phase
for _ in range(warmup):
func(*args)
if synchronize and use_cuda:
torch.cuda.synchronize()
# Timed phase
times = []
for _ in range(iterations):
start = time.perf_counter()
func(*args)
if synchronize and use_cuda:
torch.cuda.synchronize()
end = time.perf_counter()
times.append((end - start) * 1e3) # Convert to ms
return {
'mean': np.mean(times),
'std': np.std(times),
'min': np.min(times),
'max': np.max(times),
'p50': np.percentile(times, 50),
'p99': np.percentile(times, 99)
}
def benchmark_throughput(func, shape, dtype, device='cuda:0'):
"""Measure arithmetic throughput in TOPS."""
M, N, K = shape
# FLOP count for GEMM: 2 * M * N * K
flops = 2 * M * N * K
result = benchmark_kernel(func)
time_us = result['mean'] * 1e3
tops = flops / (time_us * 1e-6) / 1e12
return tops
# Example usage
shapes = [(1024, 1024, 1024), (2048, 2048, 2048), (4096, 4096, 4096)]
for shape in shapes:
M, N, K = shape
times = benchmark_kernel(gemm_kernel, M, N, K, ...)
print(f"Shape {shape}: {times['mean']:.2f} ms ± {times['std']:.2f}")
Throughput and Latency Profiles
def profile_kernel_memory(ncu_path, kernel_binary, args):
"""Profile memory bandwidth utilization with ncu."""
cmd = [
ncu_path, '--metrics', 'dram_throughput,sm_throughput',
'--set', 'full', kernel_binary, *args
]
result = subprocess.run(cmd, capture_output=True, text=True)
return parse_ncu_output(result.stdout)
Local verification checkpoint
Run the smallest example from this chapter in a local workspace and record the package version, runtime, data path, and observed output. If the result depends on model size, vector count, CPU/GPU backend, or available memory, note that constraint beside the exercise so the lesson remains reproducible.
EXERCISE
Profile your quantized GEMM kernel and compare achieved memory bandwidth to device theoretical peak. Identify the bottleneck (compute-bound vs memory-bound).