Quantization Formats Compared — Model Optimization for Local Inference (Chapter 2)

Three quantization ecosystems dominate the local inference landscape: GGUF, GPTQ, and AWQ. Each represents different tradeoffs between compression ratio, accuracy preservation, and hardware compatibility.

GGUF (formerly GGML) originated in the llama.cpp ecosystem. It excels at CPU inference with optional GPU acceleration. Models quantize to power-of-two bit depths (q4_0, q5_1, q8_0). The format supports mmapped loading, allowing models larger than available RAM by streaming from disk. Python bindings through llama-cpp-python provide accessible integration.

from llama_cpp import Llama
llm = Llama(
    model_path="./models/llama-7b.q4_0.gguf",
    n_gpu_layers=35,  # offload to GPU
    n_ctx=4096
)
response = llm.create_chat_completion(
    messages=[{"role": "user", "content": "Explain quantization"}]
)

GPTQ emerged from academic research on post-training quantization. It achieves better accuracy-per-bit than naive approaches by preserving quantizer granularity across groups. The auto-gptq library provides easy conversion. Hardware acceleration requires specific CUDA kernels—primarily NVIDIA GPUs with Tensor cores.

# Install auto-gptq
pip install auto-gptq

# Quantize model
python -c "
from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig
model = AutoGPTQForCausalLM.from_pretrained('meta-llama/Llama-2-7b-hf')
quantizer = BaseQuantizeConfig(bits=4, group_size=128)
# Calibration dataset for representative samples
"

AWQ (Activation-aware Weight Quantization) takes a different approach. Rather than optimizing weight quantization directly, it identifies weights with disproportionate impact on activation distributions. This method preserves more model capability at aggressive bit depths.

Format comparison:

Format	Best Use Case	Hardware	Memory Savings	Accuracy
GGUF	CPU-first, mixed hardware	CPU + NVIDIA	4-8x	Good
GPTQ	GPU-optimized serving	NVIDIA Tensor Core	4x	Very Good
AWQ	Edge deployment	NVIDIA (any)	4-8x	Excellent

Hardware constraints often make this choice for you. GGUF remains the only viable option for Apple Silicon or CPU-only deployments. GPTQ and AWQ require NVIDIA GPUs with sufficient VRAM to benefit from their kernels.