RUNLOCALAIv38
->Will it run?Best GPUCompareTroubleshootStartLearnPulseModelsHardwareToolsBench
Run check
RUNLOCALAI

Independently operated catalog for local-AI hardware and software. Hand-written verdicts. Source-cited claims. Reproducible commands when we have them.

OP·Fredoline Eruo
DIR
  • Models
  • Hardware
  • Tools
  • Benchmarks
TOOLS
  • Will it run?
  • Compare hardware
  • Cost vs cloud
  • Choose my GPU
  • Prompting kits
  • Quick answers
REF
  • All buyer guides
  • Learn local AI
  • Methodology
  • Glossary
  • Errors KB
  • Trust
EDITOR
  • About
  • Author
  • How we make money
  • Editorial policy
  • Contact
LEGAL
  • Privacy
  • Terms
  • Sitemap
MAIL · MONTHLY DIGEST
Get monthly local AI changes
Monthly recap. No spam.
DISCLOSURE

Some links on this site are affiliate links (Amazon Associates and other first-class retailers). When you buy through them, we earn a small commission at no extra cost to you. Affiliate links do not influence our verdicts — there are cards we rate highly that we don't have affiliate relationships with, and cards that sell well that we refuse to recommend. Read more →

© 2026 runlocalai.coIndependently operated
RUNLOCALAI · v38
  1. >
  2. Home
  3. /Learn
  4. /Courses
  5. /Local AI Clusters
  6. /Ch. 17
Local AI Clusters

17. Cluster Benchmarking

Chapter 17 of 18 · 20 min
KEY INSIGHT

Benchmarking reveals that naive single-request serving wastes GPU capacity. The latency-throughput tradeoff is not linear—batching provides diminishing returns while latency grows super-linearly. Optimal configurations target specific SLAs rather than maximizing either metric in isolation.

Systematic benchmarking quantifies throughput, latency, and resource efficiency, enabling comparison across configurations and identifying bottlenecks.

Benchmarking Tools

For LLM inference, llama.cpp provides llama-bench:

# Build benchmarking tool
cmake .. -DLLAMA_ACCELERATE=ON -DLLAMA_BENCHMARK=ON
make llama-bench

# Run benchmark with specific model
./llama-bench -m models/llama-3-8b.Q4_K_M.gguf \
  -ngl 99 \
  -t 8 \
  -n 2048 \
  -co 512

Output includes tokens-per-second for different batch sizes and context lengths.

Network Performance Testing

Shared storage bandwidth limits multi-node training efficiency:

# Test NFS/CephFS throughput
dd if=/dev/zero of=/shared/testfile bs=1M count=1024 oflag=direct
sync && rm /shared/testfile

# Test with fio for more realistic workload
fio --name=seq-write --filename=/shared/fiotest --rw=write \
  --bs=1m --size=1g --numjobs=4 --runtime=60

Target throughput depends on model size: checkpoint saves during training require sustained bandwidth of at least 500MB/s.

GPU Benchmark Suite

NVIDIA provides dcgm-profiler for detailed GPU performance analysis:

# Install DCGM profiler
apt-get install -y datacenter-gpu-manager Tools

# Profile a training workload for 60 seconds
dcgmi profile --start -g 0
./train.py --config config.yaml
dcgmi profile --stop -g 0

# Generate report
dcgmi profile --report -g 0 -f profile_output.csv

Key metrics include SM occupancy, memory bandwidth utilization, and tensor core efficiency.

Latency Distribution Analysis

Collect end-to-end latency histograms:

# Collect latency samples from inference service
curl -s http://inference-service:8080/metrics | grep inference_request_seconds

# Calculate percentiles
python3 <<EOF
import subprocess
result = subprocess.run(['curl', '-s', 'http://inference:8080/metrics'],
                       capture_output=True, text=True)
# Parse Prometheus histogram buckets for P50, P90, P99, P999
EOF

Tail latency (P99+) matters for user-facing inference where occasional slow responses degrade perceived quality.

Throughput vs Latency Tradeoff

Batch inference increases throughput but increases latency per request:

Batch Size Requests/Second Avg Latency P99 Latency
1 2.3 435ms 520ms
4 8.1 493ms 680ms
16 18.4 871ms 1200ms
64 29.2 2193ms 3100ms

Tune batch size based on workload SLAs: interactive workloads prefer low latency, batch processing prioritizes throughput.

EXERCISE

Run llama-bench with the same model at batch sizes 1, 4, and 16, record the throughput and latency results, then calculate the cost-per-token for each configuration at your cluster's electricity rate.

← Chapter 16
Cost Analysis
Chapter 18 →
Local AI Cluster Project