RUNLOCALAIv38
->Will it run?Best GPUCompareTroubleshootStartLearnPulseModelsHardwareToolsBench
Run check
RUNLOCALAI

Independently operated catalog for local-AI hardware and software. Hand-written verdicts. Source-cited claims. Reproducible commands when we have them.

OP·Fredoline Eruo
DIR
  • Models
  • Hardware
  • Tools
  • Benchmarks
TOOLS
  • Will it run?
  • Compare hardware
  • Cost vs cloud
  • Choose my GPU
  • Prompting kits
  • Quick answers
REF
  • All buyer guides
  • Learn local AI
  • Methodology
  • Glossary
  • Errors KB
  • Trust
EDITOR
  • About
  • Author
  • How we make money
  • Editorial policy
  • Contact
LEGAL
  • Privacy
  • Terms
  • Sitemap
MAIL · MONTHLY DIGEST
Get monthly local AI changes
Monthly recap. No spam.
DISCLOSURE

Some links on this site are affiliate links (Amazon Associates and other first-class retailers). When you buy through them, we earn a small commission at no extra cost to you. Affiliate links do not influence our verdicts — there are cards we rate highly that we don't have affiliate relationships with, and cards that sell well that we refuse to recommend. Read more →

© 2026 runlocalai.coIndependently operated
RUNLOCALAI · v38
  1. >
  2. Home
  3. /Learn
  4. /Courses
  5. /Model Optimization for Local Inference
  6. /Ch. 4
Model Optimization for Local Inference

04. AWQ Quantization

Chapter 4 of 18 · 20 min
KEY INSIGHT

AWQ exploits the insight that most weights matter little—quantize aggressively where it costs nothing, preserve precision where it counts.

AWQ (Activation-aware Weight Quantization) discovered that only 1% of weights significantly impact model activation magnitude. Preserving these outlier channels in higher precision while aggressively quantizing the remaining 99% yields better results than uniform quantization across all weights.

The algorithm identifies outlier weights by running sample activations through the model and tracking weight contributions. These channels remain at fp16 while the rest quantize to INT4. The ratio of preserved channels is typically 0.1-1% of total weights.

from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer

model_path = "meta-llama/Llama-2-7b-hf"
tokenizer = AutoTokenizer.from_pretrained(model_path)

# Load model
model = AutoAWQForCausalLM.from_pretrained(
    model_path,
    torch_dtype=torch.float16,
    device_map="auto"
)

# Quantization config
quant_config = {
    "zero_point": True,
    "q_group_size": 128,
    "w_bit": 4,
    "version": "GEMM"  # GEMM for faster inference
}

# Calibration samples - use domain-appropriate data
calibration_data = [
    # Code samples for code models, prose for language models
]

# Quantize
model.quantize(tokenizer, calib_dataset=calibration_data)
model.save_quantized("llama-2-7b-awq")

AWQ supports two kernel versions: GEMM and GEMV. GEMM (General Matrix Multiply) optimizes batched inference—better for serving multiple requests. GEMV (General Matrix Vector) optimizes single-sequence inference—better for interactive use cases.

The version parameter matters more than commonly discussed. Switching from GEMM to GEMV can improve latency by 30-50% for single-user workloads, but GEMM wins when processing multiple requests in parallel.

Hardware considerations: AWQ kernels require CUDA-capable NVIDIA GPUs. Kepler (GTX 700 series) and Maxwell (GTX 900 series) lack support. Maxwell+ devices (GTX 1000 series and newer) work correctly.

Common failure modes:

# Mismatched kernel version causes runtime error
model = AutoAWQForCausalLM.from_quantized(
    "model.awq",
    version="GEMV"  # If your hardware doesn't support GEMV
)
# RuntimeError: Kernel version not supported on this hardware

For safety, detect GPU compute capability before selecting kernel version:

# Check compute capability
nvidia-smi --query-gpu=compute_cap --format=csv,noheader
# 7.5 = RTX 20/30/40 series, 8.6 = RTX 30/40 laptop, 8.9 = RTX 40 series
EXERCISE

Implement a function that auto-detects GPU compute capability and selects the appropriate AWQ kernel version. Benchmark both versions with single-query latency.

← Chapter 3
GPTQ Quantization
Chapter 5 →
GGUF Quantization