AWQ
AWQ (Activation-aware Weight Quantization) is a 4-bit quantization method designed for fast inference on NVIDIA GPUs. It's the production-default quant for vLLM and SGLang serving. AWQ analyzes activation distributions during calibration to identify "salient" weight channels and protects them at higher precision while aggressively quantizing the rest. Result: ~2% quality loss vs FP16 on most reasoning benchmarks; ~3.5× memory savings.
Operator notes that matter: AWQ is NVIDIA-only (no AMD, no Apple). It requires a calibration dataset (default ones ship with the AutoAWQ library — usually fine). vLLM 0.7+ ships AWQ kernels with full PagedAttention compatibility; throughput on A100/H100 is within 5% of FP16 at much lower VRAM cost. Compared to GPTQ: AWQ is generally faster at inference; GPTQ has more aggressive quant variants. Compared to GGUF Q4_K_M: AWQ is faster on serving runtimes; GGUF works on more backends but lacks the kernel-level vLLM optimization.
When to use AWQ: production NVIDIA serving with vLLM/SGLang, where throughput-per-VRAM-dollar matters. When NOT to use AWQ: AMD or Apple deployments (use GGUF Q4_K_M instead), or workloads where you need the absolute strongest quant quality at all costs (use FP8 if you have H100).
Related terms
See also
Reviewed by Fredoline Eruo. See our editorial policy.