19. GGUF Conversion
Chapter 19 of 24 · 20 min
Local verification checkpoint
Run the smallest example from this chapter in a local workspace and record the package version, runtime, data path, and observed output. If the result depends on model size, vector count, CPU/GPU backend, or available memory, note that constraint beside the exercise so the lesson remains reproducible.
EXERCISE
: Benchmark Quantization Impact
Compare inference results across quantization levels:
from llama_cpp import Llama
import numpy as np
def benchmark_quantization(model_paths, test_prompt, tokenizer):
"""Measure latency and output quality across quantizations."""
results = {}
for path in model_paths:
model = Llama(model_path=str(path), n_ctx=2048)
output = model(test_prompt, max_tokens=100)
results[path.name] = {
"latency": output["timings"]["eval_duration"] / 1e9,
"tokens": len(output["choices"][0]["text"].split())
}
return results