Cluster Benchmarking — Local AI Clusters (Chapter 17)

Systematic benchmarking quantifies throughput, latency, and resource efficiency, enabling comparison across configurations and identifying bottlenecks.

Benchmarking Tools

For LLM inference, llama.cpp provides llama-bench:

# Build benchmarking tool
cmake .. -DLLAMA_ACCELERATE=ON -DLLAMA_BENCHMARK=ON
make llama-bench

# Run benchmark with specific model
./llama-bench -m models/llama-3-8b.Q4_K_M.gguf \
  -ngl 99 \
  -t 8 \
  -n 2048 \
  -co 512

Output includes tokens-per-second for different batch sizes and context lengths.

Network Performance Testing

Shared storage bandwidth limits multi-node training efficiency:

# Test NFS/CephFS throughput
dd if=/dev/zero of=/shared/testfile bs=1M count=1024 oflag=direct
sync && rm /shared/testfile

# Test with fio for more realistic workload
fio --name=seq-write --filename=/shared/fiotest --rw=write \
  --bs=1m --size=1g --numjobs=4 --runtime=60

Target throughput depends on model size: checkpoint saves during training require sustained bandwidth of at least 500MB/s.

GPU Benchmark Suite

NVIDIA provides dcgm-profiler for detailed GPU performance analysis:

# Install DCGM profiler
apt-get install -y datacenter-gpu-manager Tools

# Profile a training workload for 60 seconds
dcgmi profile --start -g 0
./train.py --config config.yaml
dcgmi profile --stop -g 0

# Generate report
dcgmi profile --report -g 0 -f profile_output.csv

Key metrics include SM occupancy, memory bandwidth utilization, and tensor core efficiency.

Latency Distribution Analysis

Collect end-to-end latency histograms:

# Collect latency samples from inference service
curl -s http://inference-service:8080/metrics | grep inference_request_seconds

# Calculate percentiles
python3 <<EOF
import subprocess
result = subprocess.run(['curl', '-s', 'http://inference:8080/metrics'],
                       capture_output=True, text=True)
# Parse Prometheus histogram buckets for P50, P90, P99, P999
EOF

Tail latency (P99+) matters for user-facing inference where occasional slow responses degrade perceived quality.

Throughput vs Latency Tradeoff

Batch inference increases throughput but increases latency per request:

Batch Size	Requests/Second	Avg Latency	P99 Latency
1	2.3	435ms	520ms
4	8.1	493ms	680ms
16	18.4	871ms	1200ms
64	29.2	2193ms	3100ms

Tune batch size based on workload SLAs: interactive workloads prefer low latency, batch processing prioritizes throughput.