02. Quantization Formats Compared

Chapter 2 of 18 · 15 min

Three quantization ecosystems dominate the local inference landscape: GGUF, GPTQ, and AWQ. Each represents different tradeoffs between compression ratio, accuracy preservation, and hardware compatibility.

GGUF (formerly GGML) originated in the llama.cpp ecosystem. It excels at CPU inference with optional GPU acceleration. Models quantize to power-of-two bit depths (q4_0, q5_1, q8_0). The format supports mmapped loading, allowing models larger than available RAM by streaming from disk. Python bindings through llama-cpp-python provide accessible integration.

from llama_cpp import Llama
llm = Llama(
    model_path="./models/llama-7b.q4_0.gguf",
    n_gpu_layers=35,  # offload to GPU
    n_ctx=4096
)
response = llm.create_chat_completion(
    messages=[{"role": "user", "content": "Explain quantization"}]
)

GPTQ emerged from academic research on post-training quantization. It achieves better accuracy-per-bit than naive approaches by preserving quantizer granularity across groups. The auto-gptq library provides easy conversion. Hardware acceleration requires specific CUDA kernels—primarily NVIDIA GPUs with Tensor cores.

# Install auto-gptq
pip install auto-gptq

# Quantize model
python -c "
from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig
model = AutoGPTQForCausalLM.from_pretrained('meta-llama/Llama-2-7b-hf')
quantizer = BaseQuantizeConfig(bits=4, group_size=128)
# Calibration dataset for representative samples
"

AWQ (Activation-aware Weight Quantization) takes a different approach. Rather than optimizing weight quantization directly, it identifies weights with disproportionate impact on activation distributions. This method preserves more model capability at aggressive bit depths.

Format comparison:

Format Best Use Case Hardware Memory Savings Accuracy
GGUF CPU-first, mixed hardware CPU + NVIDIA 4-8x Good
GPTQ GPU-optimized serving NVIDIA Tensor Core 4x Very Good
AWQ Edge deployment NVIDIA (any) 4-8x Excellent

Hardware constraints often make this choice for you. GGUF remains the only viable option for Apple Silicon or CPU-only deployments. GPTQ and AWQ require NVIDIA GPUs with sufficient VRAM to benefit from their kernels.

EXERCISE

Quantize the same 7B model in GGUF (q4_K_M) and GPTQ (W4A16). Benchmark perplexity on a standard evaluation set. Compare inference speed using identical prompts.