02. Quantization Formats Compared
Three quantization ecosystems dominate the local inference landscape: GGUF, GPTQ, and AWQ. Each represents different tradeoffs between compression ratio, accuracy preservation, and hardware compatibility.
GGUF (formerly GGML) originated in the llama.cpp ecosystem. It excels at CPU inference with optional GPU acceleration. Models quantize to power-of-two bit depths (q4_0, q5_1, q8_0). The format supports mmapped loading, allowing models larger than available RAM by streaming from disk. Python bindings through llama-cpp-python provide accessible integration.
from llama_cpp import Llama
llm = Llama(
model_path="./models/llama-7b.q4_0.gguf",
n_gpu_layers=35, # offload to GPU
n_ctx=4096
)
response = llm.create_chat_completion(
messages=[{"role": "user", "content": "Explain quantization"}]
)
GPTQ emerged from academic research on post-training quantization. It achieves better accuracy-per-bit than naive approaches by preserving quantizer granularity across groups. The auto-gptq library provides easy conversion. Hardware acceleration requires specific CUDA kernels—primarily NVIDIA GPUs with Tensor cores.
# Install auto-gptq
pip install auto-gptq
# Quantize model
python -c "
from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig
model = AutoGPTQForCausalLM.from_pretrained('meta-llama/Llama-2-7b-hf')
quantizer = BaseQuantizeConfig(bits=4, group_size=128)
# Calibration dataset for representative samples
"
AWQ (Activation-aware Weight Quantization) takes a different approach. Rather than optimizing weight quantization directly, it identifies weights with disproportionate impact on activation distributions. This method preserves more model capability at aggressive bit depths.
Format comparison:
| Format | Best Use Case | Hardware | Memory Savings | Accuracy |
|---|---|---|---|---|
| GGUF | CPU-first, mixed hardware | CPU + NVIDIA | 4-8x | Good |
| GPTQ | GPU-optimized serving | NVIDIA Tensor Core | 4x | Very Good |
| AWQ | Edge deployment | NVIDIA (any) | 4-8x | Excellent |
Hardware constraints often make this choice for you. GGUF remains the only viable option for Apple Silicon or CPU-only deployments. GPTQ and AWQ require NVIDIA GPUs with sufficient VRAM to benefit from their kernels.
Quantize the same 7B model in GGUF (q4_K_M) and GPTQ (W4A16). Benchmark perplexity on a standard evaluation set. Compare inference speed using identical prompts.