19. Hardware Benchmarking
Chapter 19 of 20 · 20 min
Standardized benchmarks enable comparisons across hardware configurations. Using consistent methodology reveals actual performance differences.
Benchmark Methodology
Essential benchmarks for AI hardware:
- Token generation throughput (tokens/second)
- Context encoding speed (ms/token)
- Time to first token (latency)
- Batch processing throughput (tokens/second/batch size)
Maintain consistent test conditions:
# Standard test parameters
MODEL="llama-3-8b-instruct-q4_k_m.gguf"
CONTEXT=4096
MAX_TOKENS=512
WARMUP_PASSES=3
TEST_PASSES=5
PROMPT="Write a Python function to calculate fibonacci numbers:"
llama.cpp Benchmark Tool
# Install and run llama-bench
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
make LLAMA_CUBLAS=1
# Run standardized benchmark
./llama-bench \
-m models/llama-3-8b-q4_k_m.gguf \
-ngl 999 \
-t 16 \
-c 4096 \
-n 512
# Output includes:
# main: build: 2811 (b3298)
# main: performing importance sampling...
# ggml_cuda_init: initialized successfully
# model_load_internal: finished loading
# llama.cpp build: 2811
#
# | model | params | test | t/s |
# |---------:|----------:|--------:|------:|
# | llama-3-8b | 8.05 B | pp512 | 495.62 |
# | llama-3-8b | 8.05 B | tg512 | 23.49 |
# | llama-3-8b | 8.05 B | total | 25.17 |
Data Collection Template
import time
import subprocess
import json
def benchmark_model(model_path, ngl, threads, context=4096):
"""Run benchmark and extract performance metrics."""
result = subprocess.run(
["./llama-bench", "-m", model_path, "-ngl", str(ngl),
"-t", str(threads), "-c", str(context)],
capture_output=True, text=True
)
# Parse token/s from output
# ... (parsing logic)
return {
"tokens_per_second": tokens_sec,
"context_encoding": pp_tok_sec,
"test_duration": duration
}
# Run across configurations
configs = [
("7B-Q4", "models/llama-3-8b-q4_k_m.gguf", 999, 16),
("7B-Q8", "models/llama-3-8b-q8_0.gguf", 999, 16),
]
results = []
for name, model, ngl, threads in configs:
result = benchmark_model(model, ngl, threads)
results.append((name, result))
print(f"{name}: {result['tokens_per_second']:.1f} tok/s")
Comparative Benchmarks
Test multiple GPUs on identical software:
#!/bin/bash
# benchmark_gpu_comparison.sh
GPUS=("0" "1")
MODELS=("7B" "13B")
QUANTS=("q4_k_m" "q8_0")
for gpu in "${GPUS[@]}"; do
echo "=== Testing GPU $gpu ==="
for model in "${MODELS[@]}"; do
for quant in "${QUANTS[@]}"; do
echo "Testing $model $quant..."
CUDA_VISIBLE_DEVICES=$gpu ./llama-bench \
-m models/llama-3-${model,,}-${quant}.gguf \
-ngl 999 2>&1 | tee "results_${gpu}_${model}_${quant}.log"
done
done
done
EXERCISE
Run llama-bench on your current hardware with a 7B and 13B model. Log results and compare against published reference benchmarks for your GPU model.