GGUF Quantization — Model Optimization for Local Inference (Chapter 5)

GGUF (previously GGML) operates on fundamentally different principles than GPTQ/AWQ. Originally designed for CPU inference, it uses a unified memory model where models load entirely into memory and optionally accelerate on GPU. This design enables mmapped loading and streaming inference for models too large to fit in RAM.

The format defines quantization schemes as type identifiers:

Q4_0: 4-bit, no added quantization constants
Q4_1: 4-bit, with scale factors
Q5_0, Q5_1: 5-bit variants
Q8_0: 8-bit, minimal quality loss
Q2_K: 2-bit with 4-bit quantization constants for important values

The _K suffix denotes "keep" variants that use mixed precision for critical values. Q4_K_M (4-bit with medium keep) balances memory and quality effectively for most use cases.

# llama-cpp-python usage
from llama_cpp import Llama

llm = Llama(
    model_path="./models/llama-2-7b.Q4_K_M.gguf",
    n_gpu_layers=35,  # Offload layers to GPU
    n_ctx=8192,       # Context window
    n_threads=8,      # CPU threads
    flash_attn=True   # Enable Flash Attention if supported
)

# Streaming completion
stream = llm.create_chat_completion(
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "What is quantization?"}
    ],
    stream=True
)

for chunk in stream:
    print(chunk["choices"][0]["delta"]["content"], end="", flush=True)

GPU offloading configuration requires knowing your hardware. A 3090 with 24GB VRAM can offload roughly 35 layers of a 7B Q4 model. Calculate required VRAM as: (layer_size_bytes * layers) + (context_size * 2 bytes).

# Estimate GPU layers possible
total_vram_gb = 24
model_size_gb = 4.5  # Q4_K_M 7B model
context_gb = 0.5     # 8K context overhead

available_gb = total_vram_gb - context_gb
layers_possible = int((available_gb / model_size_gb) * 33)  # 33 = total layers in 7B
print(f"Approximately {layers_possible} layers can fit in VRAM")

Memory-mapped loading enables working with models larger than available RAM:

# For 70B models or systems with limited RAM
llm = Llama(
    model_path="./models/llama-70b.Q4_K_M.gguf",
    use_mmap=True,    # Enable memory mapping
    use_mlock=True,   # Lock mapped memory to prevent swapping
    n_gpu_layers=0    # CPU-only if VRAM insufficient
)

Performance tuning:

# Optimal settings for RTX 4090 with Ryzen 7950X
llm = Llama(
    model_path="./models/llama-7b.Q4_K_M.gguf",
    n_gpu_layers=35,
    n_threads=16,        # Physical cores, not logical
    n_threads_batch=16,
    rope_freq_base=1000000,  # Extended context scaling
    offload_k_cached=True,   # Keep K cache in RAM
)