AWQ

AWQ (Activation-aware Weight Quantization) is a 4-bit quantization method designed for fast inference on NVIDIA GPUs. It's the production-default quant for vLLM and SGLang serving. AWQ analyzes activation distributions during calibration to identify "salient" weight channels and protects them at higher precision while aggressively quantizing the rest. Result: ~2% quality loss vs FP16 on most reasoning benchmarks; ~3.5× memory savings.

Operator notes that matter: AWQ is NVIDIA-only (no AMD, no Apple). It requires a calibration dataset (default ones ship with the AutoAWQ library — usually fine). vLLM 0.7+ ships AWQ kernels with full PagedAttention compatibility; throughput on A100/H100 is within 5% of FP16 at much lower VRAM cost. Compared to GPTQ: AWQ is generally faster at inference; GPTQ has more aggressive quant variants. Compared to GGUF Q4_K_M: AWQ is faster on serving runtimes; GGUF works on more backends but lacks the kernel-level vLLM optimization.

When to use AWQ: production NVIDIA serving with vLLM/SGLang, where throughput-per-VRAM-dollar matters. When NOT to use AWQ: AMD or Apple deployments (use GGUF Q4_K_M instead), or workloads where you need the absolute strongest quant quality at all costs (use FP8 if you have H100).

Related terms

See also