19. GGUF Conversion

Chapter 19 of 24 · 20 min

Local verification checkpoint

Run the smallest example from this chapter in a local workspace and record the package version, runtime, data path, and observed output. If the result depends on model size, vector count, CPU/GPU backend, or available memory, note that constraint beside the exercise so the lesson remains reproducible.

EXERCISE

: Benchmark Quantization Impact

Compare inference results across quantization levels:

from llama_cpp import Llama
import numpy as np

def benchmark_quantization(model_paths, test_prompt, tokenizer):
    """Measure latency and output quality across quantizations."""
    results = {}
    
    for path in model_paths:
        model = Llama(model_path=str(path), n_ctx=2048)
        output = model(test_prompt, max_tokens=100)
        
        results[path.name] = {
            "latency": output["timings"]["eval_duration"] / 1e9,
            "tokens": len(output["choices"][0]["text"].split())
        }
    
    return results