System guide · Quantization

Quantization formats for local AI — GGUF, AWQ, GPTQ, EXL2, FP8, MLX

The cross-runtime quant format reference. What GGUF / AWQ / GPTQ / EXL2 / FP8 / MLX-4bit actually do, when each wins, real quality-vs-VRAM tradeoffs, and how to pick the right one without trusting marketing.

By Fredoline Eruo · Last reviewed 2026-05-07

Why quantization exists

A 70B-parameter model in FP16 takes 140 GB of VRAM. No consumer GPU has 140 GB. Quantization is the bridge: it shrinks model weights from FP16 (16 bits per parameter) to 4-8 bits per parameter, with editorial-grade quality loss (typically 1-3% on reasoning benchmarks) but 3-4× memory savings. A Llama 3.3 70B in FP16 needs an H100 80GB or datacenter cluster; in Q4_K_M it fits on a single RTX 4090 (24 GB) or dual RTX 3090 NVLink with comfortable headroom.

The real cost of quantization isn't the quality drop — that's well-characterized at 4-bit and above. The real cost is format fragmentation: every runtime favors a different quant family, and the wrong choice means rebuilding your stack. This guide is the cross-runtime reference for picking the right format for your hardware + runtime + budget.

GPU VRAM partitioning during inferenceTwo horizontal bars showing how 24 GB of VRAM is partitioned between model weights, KV cache, activations, and runtime overhead. The first bar represents a baseline 4K context. The second shows a 32K context where KV cache grows substantially.24 GB card32B AWQ-INT44K context18 GB24 GB usedLong context32K contextspills VRAM18 GB10 GB32 GB needed24 GB limitModel weightsKV cacheActivationsRuntime overhead
VRAM is not just weights. KV cache scales with context length and batch size, so a model that fits at 4K can spill at 32K on the same card. Plan headroom against the workload, not the weight file.

How to read a quant format name

Quant names look cryptic. The decoder ring:

  • Bit-width prefix: Q4 = 4-bit, Q5 = 5-bit, Q8 = 8-bit. FP8 = 8-bit floating-point, INT4 = 4-bit integer.
  • Quant family: K_M / K_S = K-quants in llama.cpp (M=medium quality, S=small/aggressive). 0 / 1 = legacy GGML quants. AWQ, GPTQ, EXL2 = format-specific names.
  • Group size (where applicable): g32, g64, g128 — granularity at which scale/zero-point parameters are stored. Smaller group = higher quality, larger file.

Common mappings: Q4_K_M ≈ 4.83 bits/parameter average (mixed K-quant), Q5_K_M ≈ 5.69 bits/param, Q8_0 ≈ 8.5 bits/param, AWQ-INT4 ≈ 4.25 bits/param, FP8 = exactly 8 bits/param. Naive “params × 4 / 8 = file size” under-predicts actual disk size by ~10-20%.

GGUF — the portable default

GGUF (GPT-Generated Unified Format) is the llama.cpp file format and the most-portable quant ecosystem. Single file contains the weights + tokenizer + chat template + metadata. K-quants (Q4_K_M, Q5_K_M, Q8_0) are the modern default; legacy quants (Q4_0, Q4_1) exist for backward compatibility but are operationally inferior.

Strengths: runs on every backend — CPU, NVIDIA CUDA, AMD ROCm, Apple Metal, Vulkan, Intel SYCL, ARM. The same file works across llama.cpp, Ollama, LM Studio, KoboldCPP, llamafile. K-quant kernel coverage in 2026 is excellent across all platforms.

Weaknesses: throughput on production-serving runtimes (vLLM, SGLang) trails AWQ by 15-30% because the K-quant kernel isn't as optimized for continuous batching as the AWQ-Marlin kernel pair. GGUF is operationally the right choice for portability + single-user inference, not for production multi-tenant serving.

AWQ — the production-NVIDIA serving leader

AWQ (Activation-aware Weight Quantization) identifies “salient” weight channels via activation analysis and protects them at higher precision while aggressively quantizing the rest. 4-bit only. NVIDIA-only.

Why AWQ wins for vLLM/SGLang: the AWQ → Marlin kernel is the most-optimized 4-bit serving kernel on H100/Ada/Ampere. With PagedAttention and continuous batching, AWQ-INT4 hits 95-98% of FP16 throughput on Hopper at 25% of the VRAM. That's why every production NVIDIA serving stack defaults to AWQ: it's the throughput-per-VRAM-dollar leader.

Operational caveats: AWQ requires a calibration dataset (default ones in AutoAWQ are usually fine). Some quant variants (e.g. early Llama 4 AWQ checkpoints) have tokenizer-alignment issues with tool calling — verify clean JSON output before committing. AWQ-INT3 exists but the quality drop is meaningful (~5-7% additional vs INT4); stay at INT4 unless VRAM is the absolute constraint.

GPTQ — the predecessor

GPTQ (Generative Pre-trained Transformer Quantization) was the dominant 4-bit format from 2023-2024. Uses approximate second-order Hessian information to choose quant values. NVIDIA-only.

When GPTQ matters in 2026: the model only ships with GPTQ quants and AWQ isn't available; you're running ExLlamaV2 (which uses GPTQ-derived EXL2 quants); you need 3-bit quantization (more aggressive than AWQ's 4-bit floor). For new vLLM deployments, AWQ is generally the better choice — the GPTQ kernel optimization gets less attention from the vLLM team.

EXL2 — the consumer-NVIDIA single-stream leader

EXL2 is the ExLlamaV2 format. Allows fractional bit-rates (e.g. 4.65 bits/weight) by mixing higher-precision weights for important channels with lower-precision elsewhere. NVIDIA-only.

The EXL2 case operationally: fastest single-stream tok/s on consumer NVIDIA. Beats vLLM AWQ by 10-25% per-stream throughput at the same VRAM target. The dual-3090 NVLink + ExLlamaV2 combo is one of the highest single-stream-throughput setups under $2,000 in 2026.

Compatibility cost: EXL2 doesn't support continuous batching the way vLLM does, so multi-tenant concurrency is weaker. EXL2 doesn't run on AMD or Apple. For solo-user or small-team deployments where peak per-stream tok/s matters: EXL2. For multi-user production serving: AWQ via vLLM.

FP8 — the H100 / Ada transformer-engine path

FP8 is 8-bit floating-point quantization that uses NVIDIA's Hopper or Ada transformer engine for hardware-accelerated dequantization. Two variants: E4M3 (4 exponent bits, 3 mantissa) for weights, E5M2 (5 exponent, 2 mantissa) for activations.

When FP8 wins: H100 / H200 / B100 production deployments where you need quality close to FP16 (FP8 typically hits ~1% quality loss vs FP16, vs ~2% for AWQ-INT4) AND can pay the 2× VRAM cost vs INT4. RTX 4090 and 5090 also support FP8 on the Ada/Blackwell transformer engine but the throughput uplift is smaller than on Hopper.

FP8 doesn't help on AMD or Apple. Some FP8 quants require precise vLLM/TensorRT-LLM versions — verify the quant's manifest before committing.

MLX-4bit / MLX-8bit — the Apple Silicon path

MLX quants are Apple-Silicon-native formats tuned for the Apple unified memory architecture. Different file format (no GGUF compatibility). Used by MLX-LM and MLX Swift.

Operational reality on Apple Silicon: MLX-4bit is the fastest path on M-series Macs for most models. llama.cpp Metal works (and uses GGUF) but trails MLX by 10-30% on tok/s for the same model. The trade-off: MLX-format models don't cross-deploy to Linux/Windows NVIDIA — you maintain dual quant variants if you ship to both Apple and CUDA targets.

Apple Foundation Models APIs (iOS 18+) ship pre-quantized variants for Apple Intelligence. These are system-bundled and not directly runnable through MLX-LM — they're Apple's closed-quant pipeline.

Marlin — the high-throughput INT4 kernel

Marlin is technically a CUDA kernel, not a quant format — but it's the kernel that makes AWQ-INT4 fast on vLLM. Mixed-precision (INT4 weights, FP16 activations) with hand-tuned CUDA tensor-core throughput.

You don't pick “Marlin” as a format; you pick AWQ-INT4 + vLLM 0.6+, and Marlin runs under the hood. Knowing it exists matters because: (1) it's why AWQ beats GPTQ on vLLM throughput, (2) Marlin-compatible quants have specific calibration constraints — non-standard group-size AWQ quants may fall back to slower kernels.

Quality vs VRAM tradeoff matrix

Approximate quality loss vs FP16 across formats (editorial estimate, varies by model):

FormatBits/paramQuality loss vs FP16VRAM savings vs FP16
FP1616baselinebaseline
FP8 (E4M3)8~1%~2×
Q8_0 (GGUF)8.5~1%~1.9×
Q6_K (GGUF)6.5~1.5%~2.5×
Q5_K_M (GGUF)5.7~2%~2.8×
AWQ-INT44.25~2%~3.7×
Q4_K_M (GGUF)4.83~2.5%~3.3×
EXL2 4.65bpw4.65~2.5%~3.4×
GPTQ-INT44.25~3-5%~3.7×
Q3_K_M (GGUF)3.9~5-8%~4.1×
AWQ-INT33.25~7-10%~5×
Q2_K (GGUF)3.0~15%+~5.3×

Operator rule of thumb: Q4_K_M / AWQ-INT4 is the knee of the curve — most VRAM savings for acceptable quality loss. Going below Q4 is rarely worth it on coding / reasoning workloads. Going above Q5 is only worth it when you have VRAM to burn and need every point of MMLU / GPQA quality.

Compatibility matrix by runtime

See the full runtime compatibility matrix for cross-axes. Format coverage:

RuntimeGGUFAWQGPTQEXL2FP8MLX
llama.cpp / Ollama / LM Studio·····
vLLM···
SGLang···
ExLlamaV2 / TabbyAPI····
TensorRT-LLM····
MLX-LM / MLX Swift·····

How to pick: decision tree

  1. What hardware? Apple Silicon → MLX-4bit. AMD GPU → GGUF Q4_K_M (ROCm path). Intel Arc / NPU → ONNX Runtime quants. NVIDIA → continue.
  2. Multi-user serving (vLLM/SGLang)? AWQ-INT4. The throughput-per-VRAM-dollar leader. FP8 only if H100 + 1% quality matters more than 2× VRAM cost.
  3. Solo user, max single-stream tok/s on consumer NVIDIA?EXL2 via ExLlamaV2 + TabbyAPI.
  4. Cross-platform portability matters? GGUF Q4_K_M. Runs on every backend.
  5. VRAM is brutally tight? → drop to Q3_K_M or AWQ-INT3 with a quality hit. Don't go lower without measuring quality on your specific workload.

Quant-related failure modes

  1. Tokenizer alignment broken on new quants. Some early-cycle AWQ checkpoints emit malformed JSON for tool calls. Test with a tool-calling prompt before committing to a quant variant.
  2. FP8 quant pipeline mismatches. Some FP8 checkpoints require specific vLLM / TensorRT-LLM versions. Verify the quant's manifest.
  3. MTP head silently dropped during quantization. DeepSeek V4's multi-token-prediction head can be lost during AWQ conversion if the pipeline doesn't handle it. Test throughput post-quant — losing MTP costs ~1.8× decode speed.
  4. Quant + flash-attention version conflict. FA2 vs FA3 vs FlashInfer have subtly different kernel requirements; some quants compile only with specific FA versions. Pin all three.
  5. K-quant kernel coverage gaps on AMD/Vulkan. Some K-quants (especially Q3 variants) lack optimized kernels on AMD ROCm or Vulkan. Stick to Q4_K_M / Q5_K_M / Q8_0 for non-CUDA backends until verified.

Going deeper