RUNLOCALAIv38
->Will it run?Best GPUCompareTroubleshootStartLearnPulseModelsHardwareToolsBench
Run check
RUNLOCALAI

Independently operated catalog for local-AI hardware and software. Hand-written verdicts. Source-cited claims. Reproducible commands when we have them.

OP·Fredoline Eruo
DIR
  • Models
  • Hardware
  • Tools
  • Benchmarks
TOOLS
  • Will it run?
  • Compare hardware
  • Cost vs cloud
  • Choose my GPU
  • Prompting kits
  • Quick answers
REF
  • All buyer guides
  • Learn local AI
  • Methodology
  • Glossary
  • Errors KB
  • Trust
EDITOR
  • About
  • Author
  • How we make money
  • Editorial policy
  • Contact
LEGAL
  • Privacy
  • Terms
  • Sitemap
MAIL · MONTHLY DIGEST
Get monthly local AI changes
Monthly recap. No spam.
DISCLOSURE

Some links on this site are affiliate links (Amazon Associates and other first-class retailers). When you buy through them, we earn a small commission at no extra cost to you. Affiliate links do not influence our verdicts — there are cards we rate highly that we don't have affiliate relationships with, and cards that sell well that we refuse to recommend. Read more →

© 2026 runlocalai.coIndependently operated
RUNLOCALAI · v38
Glossary / Training & optimization / AWQ
Training & optimization

AWQ

AWQ (Activation-aware Weight Quantization) is a 4-bit quantization method designed for fast inference on NVIDIA GPUs. It's the production-default quant for vLLM and SGLang serving. AWQ analyzes activation distributions during calibration to identify "salient" weight channels and protects them at higher precision while aggressively quantizing the rest. Result: ~2% quality loss vs FP16 on most reasoning benchmarks; ~3.5× memory savings.

Operator notes that matter: AWQ is NVIDIA-only (no AMD, no Apple). It requires a calibration dataset (default ones ship with the AutoAWQ library — usually fine). vLLM 0.7+ ships AWQ kernels with full PagedAttention compatibility; throughput on A100/H100 is within 5% of FP16 at much lower VRAM cost. Compared to GPTQ: AWQ is generally faster at inference; GPTQ has more aggressive quant variants. Compared to GGUF Q4_K_M: AWQ is faster on serving runtimes; GGUF works on more backends but lacks the kernel-level vLLM optimization.

When to use AWQ: production NVIDIA serving with vLLM/SGLang, where throughput-per-VRAM-dollar matters. When NOT to use AWQ: AMD or Apple deployments (use GGUF Q4_K_M instead), or workloads where you need the absolute strongest quant quality at all costs (use FP8 if you have H100).

Related terms

QuantizationGPTQGGUFFP8

See also

tool: vllmtool: sglangtool: tensorrt-llm
Buyer guides
  • Best GPU for local AI →
When it doesn't work
  • Quantization quality loss →
  • GGUF tokenizer mismatch →