Hardware Benchmarking — Hardware Planning for Local AI (Chapter 19)

Standardized benchmarks enable comparisons across hardware configurations. Using consistent methodology reveals actual performance differences.

Benchmark Methodology

Essential benchmarks for AI hardware:

Token generation throughput (tokens/second)
Context encoding speed (ms/token)
Time to first token (latency)
Batch processing throughput (tokens/second/batch size)

Maintain consistent test conditions:

# Standard test parameters
MODEL="llama-3-8b-instruct-q4_k_m.gguf"
CONTEXT=4096
MAX_TOKENS=512
WARMUP_PASSES=3
TEST_PASSES=5
PROMPT="Write a Python function to calculate fibonacci numbers:"

llama.cpp Benchmark Tool

# Install and run llama-bench
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
make LLAMA_CUBLAS=1

# Run standardized benchmark
./llama-bench \
    -m models/llama-3-8b-q4_k_m.gguf \
    -ngl 999 \
    -t 16 \
    -c 4096 \
    -n 512

# Output includes:
# main: build: 2811 (b3298)
# main: performing importance sampling...
# ggml_cuda_init:  initialized successfully
# model_load_internal: finished loading
# llama.cpp build: 2811
#
# |      model |     params |    test |    t/s |
# |---------:|----------:|--------:|------:|
# |  llama-3-8b |      8.05 B |    pp512 | 495.62 |
# |  llama-3-8b |      8.05 B |    tg512 |  23.49 |
# |  llama-3-8b |      8.05 B |   total |  25.17 |

Data Collection Template

import time
import subprocess
import json

def benchmark_model(model_path, ngl, threads, context=4096):
    """Run benchmark and extract performance metrics."""
    result = subprocess.run(
        ["./llama-bench", "-m", model_path, "-ngl", str(ngl), 
         "-t", str(threads), "-c", str(context)],
        capture_output=True, text=True
    )
    
    # Parse token/s from output
    # ... (parsing logic)
    
    return {
        "tokens_per_second": tokens_sec,
        "context_encoding": pp_tok_sec,
        "test_duration": duration
    }

# Run across configurations
configs = [
    ("7B-Q4", "models/llama-3-8b-q4_k_m.gguf", 999, 16),
    ("7B-Q8", "models/llama-3-8b-q8_0.gguf", 999, 16),
]

results = []
for name, model, ngl, threads in configs:
    result = benchmark_model(model, ngl, threads)
    results.append((name, result))
    print(f"{name}: {result['tokens_per_second']:.1f} tok/s")

Comparative Benchmarks

Test multiple GPUs on identical software:

#!/bin/bash
# benchmark_gpu_comparison.sh

GPUS=("0" "1")
MODELS=("7B" "13B")
QUANTS=("q4_k_m" "q8_0")

for gpu in "${GPUS[@]}"; do
    echo "=== Testing GPU $gpu ==="
    for model in "${MODELS[@]}"; do
        for quant in "${QUANTS[@]}"; do
            echo "Testing $model $quant..."
            CUDA_VISIBLE_DEVICES=$gpu ./llama-bench \
                -m models/llama-3-${model,,}-${quant}.gguf \
                -ngl 999 2>&1 | tee "results_${gpu}_${model}_${quant}.log"
        done
    done
done