05. GGUF Quantization
GGUF (previously GGML) operates on fundamentally different principles than GPTQ/AWQ. Originally designed for CPU inference, it uses a unified memory model where models load entirely into memory and optionally accelerate on GPU. This design enables mmapped loading and streaming inference for models too large to fit in RAM.
The format defines quantization schemes as type identifiers:
Q4_0: 4-bit, no added quantization constantsQ4_1: 4-bit, with scale factorsQ5_0,Q5_1: 5-bit variantsQ8_0: 8-bit, minimal quality lossQ2_K: 2-bit with 4-bit quantization constants for important values
The _K suffix denotes "keep" variants that use mixed precision for critical values. Q4_K_M (4-bit with medium keep) balances memory and quality effectively for most use cases.
# llama-cpp-python usage
from llama_cpp import Llama
llm = Llama(
model_path="./models/llama-2-7b.Q4_K_M.gguf",
n_gpu_layers=35, # Offload layers to GPU
n_ctx=8192, # Context window
n_threads=8, # CPU threads
flash_attn=True # Enable Flash Attention if supported
)
# Streaming completion
stream = llm.create_chat_completion(
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "What is quantization?"}
],
stream=True
)
for chunk in stream:
print(chunk["choices"][0]["delta"]["content"], end="", flush=True)
GPU offloading configuration requires knowing your hardware. A 3090 with 24GB VRAM can offload roughly 35 layers of a 7B Q4 model. Calculate required VRAM as: (layer_size_bytes * layers) + (context_size * 2 bytes).
# Estimate GPU layers possible
total_vram_gb = 24
model_size_gb = 4.5 # Q4_K_M 7B model
context_gb = 0.5 # 8K context overhead
available_gb = total_vram_gb - context_gb
layers_possible = int((available_gb / model_size_gb) * 33) # 33 = total layers in 7B
print(f"Approximately {layers_possible} layers can fit in VRAM")
Memory-mapped loading enables working with models larger than available RAM:
# For 70B models or systems with limited RAM
llm = Llama(
model_path="./models/llama-70b.Q4_K_M.gguf",
use_mmap=True, # Enable memory mapping
use_mlock=True, # Lock mapped memory to prevent swapping
n_gpu_layers=0 # CPU-only if VRAM insufficient
)
Performance tuning:
# Optimal settings for RTX 4090 with Ryzen 7950X
llm = Llama(
model_path="./models/llama-7b.Q4_K_M.gguf",
n_gpu_layers=35,
n_threads=16, # Physical cores, not logical
n_threads_batch=16,
rope_freq_base=1000000, # Extended context scaling
offload_k_cached=True, # Keep K cache in RAM
)
Configure GGUF for a system with limited VRAM (8GB). Test partial GPU offloading. Measure latency difference between CPU-only, partial GPU, and full GPU modes.