AWQ Quantization — Model Optimization for Local Inference (Chapter 4)

AWQ (Activation-aware Weight Quantization) discovered that only 1% of weights significantly impact model activation magnitude. Preserving these outlier channels in higher precision while aggressively quantizing the remaining 99% yields better results than uniform quantization across all weights.

The algorithm identifies outlier weights by running sample activations through the model and tracking weight contributions. These channels remain at fp16 while the rest quantize to INT4. The ratio of preserved channels is typically 0.1-1% of total weights.

from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer

model_path = "meta-llama/Llama-2-7b-hf"
tokenizer = AutoTokenizer.from_pretrained(model_path)

# Load model
model = AutoAWQForCausalLM.from_pretrained(
    model_path,
    torch_dtype=torch.float16,
    device_map="auto"
)

# Quantization config
quant_config = {
    "zero_point": True,
    "q_group_size": 128,
    "w_bit": 4,
    "version": "GEMM"  # GEMM for faster inference
}

# Calibration samples - use domain-appropriate data
calibration_data = [
    # Code samples for code models, prose for language models
]

# Quantize
model.quantize(tokenizer, calib_dataset=calibration_data)
model.save_quantized("llama-2-7b-awq")

AWQ supports two kernel versions: GEMM and GEMV. GEMM (General Matrix Multiply) optimizes batched inference—better for serving multiple requests. GEMV (General Matrix Vector) optimizes single-sequence inference—better for interactive use cases.

The version parameter matters more than commonly discussed. Switching from GEMM to GEMV can improve latency by 30-50% for single-user workloads, but GEMM wins when processing multiple requests in parallel.

Hardware considerations: AWQ kernels require CUDA-capable NVIDIA GPUs. Kepler (GTX 700 series) and Maxwell (GTX 900 series) lack support. Maxwell+ devices (GTX 1000 series and newer) work correctly.

Common failure modes:

# Mismatched kernel version causes runtime error
model = AutoAWQForCausalLM.from_quantized(
    "model.awq",
    version="GEMV"  # If your hardware doesn't support GEMV
)
# RuntimeError: Kernel version not supported on this hardware

For safety, detect GPU compute capability before selecting kernel version:

# Check compute capability
nvidia-smi --query-gpu=compute_cap --format=csv,noheader
# 7.5 = RTX 20/30/40 series, 8.6 = RTX 30/40 laptop, 8.9 = RTX 40 series