17. Cluster Benchmarking
Systematic benchmarking quantifies throughput, latency, and resource efficiency, enabling comparison across configurations and identifying bottlenecks.
Benchmarking Tools
For LLM inference, llama.cpp provides llama-bench:
# Build benchmarking tool
cmake .. -DLLAMA_ACCELERATE=ON -DLLAMA_BENCHMARK=ON
make llama-bench
# Run benchmark with specific model
./llama-bench -m models/llama-3-8b.Q4_K_M.gguf \
-ngl 99 \
-t 8 \
-n 2048 \
-co 512
Output includes tokens-per-second for different batch sizes and context lengths.
Network Performance Testing
Shared storage bandwidth limits multi-node training efficiency:
# Test NFS/CephFS throughput
dd if=/dev/zero of=/shared/testfile bs=1M count=1024 oflag=direct
sync && rm /shared/testfile
# Test with fio for more realistic workload
fio --name=seq-write --filename=/shared/fiotest --rw=write \
--bs=1m --size=1g --numjobs=4 --runtime=60
Target throughput depends on model size: checkpoint saves during training require sustained bandwidth of at least 500MB/s.
GPU Benchmark Suite
NVIDIA provides dcgm-profiler for detailed GPU performance analysis:
# Install DCGM profiler
apt-get install -y datacenter-gpu-manager Tools
# Profile a training workload for 60 seconds
dcgmi profile --start -g 0
./train.py --config config.yaml
dcgmi profile --stop -g 0
# Generate report
dcgmi profile --report -g 0 -f profile_output.csv
Key metrics include SM occupancy, memory bandwidth utilization, and tensor core efficiency.
Latency Distribution Analysis
Collect end-to-end latency histograms:
# Collect latency samples from inference service
curl -s http://inference-service:8080/metrics | grep inference_request_seconds
# Calculate percentiles
python3 <<EOF
import subprocess
result = subprocess.run(['curl', '-s', 'http://inference:8080/metrics'],
capture_output=True, text=True)
# Parse Prometheus histogram buckets for P50, P90, P99, P999
EOF
Tail latency (P99+) matters for user-facing inference where occasional slow responses degrade perceived quality.
Throughput vs Latency Tradeoff
Batch inference increases throughput but increases latency per request:
| Batch Size | Requests/Second | Avg Latency | P99 Latency |
|---|---|---|---|
| 1 | 2.3 | 435ms | 520ms |
| 4 | 8.1 | 493ms | 680ms |
| 16 | 18.4 | 871ms | 1200ms |
| 64 | 29.2 | 2193ms | 3100ms |
Tune batch size based on workload SLAs: interactive workloads prefer low latency, batch processing prioritizes throughput.
Run llama-bench with the same model at batch sizes 1, 4, and 16, record the throughput and latency results, then calculate the cost-per-token for each configuration at your cluster's electricity rate.