Quantization formats for local AI — GGUF, AWQ, GPTQ, EXL2, FP8, MLX (May 2026)

Why quantization exists

A 70B-parameter model in FP16 takes 140 GB of VRAM. No consumer GPU has 140 GB. Quantization is the bridge: it shrinks model weights from FP16 (16 bits per parameter) to 4-8 bits per parameter, with editorial-grade quality loss (typically 1-3% on reasoning benchmarks) but 3-4× memory savings. A Llama 3.3 70B in FP16 needs an H100 80GB or datacenter cluster; in Q4_K_M it fits on a single RTX 4090 (24 GB) or dual RTX 3090 NVLink with comfortable headroom.

The real cost of quantization isn't the quality drop — that's well-characterized at 4-bit and above. The real cost is format fragmentation: every runtime favors a different quant family, and the wrong choice means rebuilding your stack. This guide is the cross-runtime reference for picking the right format for your hardware + runtime + budget.

VRAM is not just weights. KV cache scales with context length and batch size, so a model that fits at 4K can spill at 32K on the same card. Plan headroom against the workload, not the weight file.

How to read a quant format name

Quant names look cryptic. The decoder ring:

Bit-width prefix: Q4 = 4-bit, Q5 = 5-bit, Q8 = 8-bit. FP8 = 8-bit floating-point, INT4 = 4-bit integer.
Quant family: K_M / K_S = K-quants in llama.cpp (M=medium quality, S=small/aggressive). 0 / 1 = legacy GGML quants. AWQ, GPTQ, EXL2 = format-specific names.
Group size (where applicable): g32, g64, g128 — granularity at which scale/zero-point parameters are stored. Smaller group = higher quality, larger file.

Common mappings: Q4_K_M ≈ 4.83 bits/parameter average (mixed K-quant), Q5_K_M ≈ 5.69 bits/param, Q8_0 ≈ 8.5 bits/param, AWQ-INT4 ≈ 4.25 bits/param, FP8 = exactly 8 bits/param. Naive “params × 4 / 8 = file size” under-predicts actual disk size by ~10-20%.

GGUF — the portable default

GGUF (GPT-Generated Unified Format) is the llama.cpp file format and the most-portable quant ecosystem. Single file contains the weights + tokenizer + chat template + metadata. K-quants (Q4_K_M, Q5_K_M, Q8_0) are the modern default; legacy quants (Q4_0, Q4_1) exist for backward compatibility but are operationally inferior.

Strengths: runs on every backend — CPU, NVIDIA CUDA, AMD ROCm, Apple Metal, Vulkan, Intel SYCL, ARM. The same file works across llama.cpp, Ollama, LM Studio, KoboldCPP, llamafile. K-quant kernel coverage in 2026 is excellent across all platforms.

Weaknesses: throughput on production-serving runtimes (vLLM, SGLang) trails AWQ by 15-30% because the K-quant kernel isn't as optimized for continuous batching as the AWQ-Marlin kernel pair. GGUF is operationally the right choice for portability + single-user inference, not for production multi-tenant serving.

AWQ — the production-NVIDIA serving leader

AWQ (Activation-aware Weight Quantization) identifies “salient” weight channels via activation analysis and protects them at higher precision while aggressively quantizing the rest. 4-bit only. NVIDIA-only.

Why AWQ wins for vLLM/SGLang: the AWQ → Marlin kernel is the most-optimized 4-bit serving kernel on H100/Ada/Ampere. With PagedAttention and continuous batching, AWQ-INT4 hits 95-98% of FP16 throughput on Hopper at 25% of the VRAM. That's why every production NVIDIA serving stack defaults to AWQ: it's the throughput-per-VRAM-dollar leader.

Operational caveats: AWQ requires a calibration dataset (default ones in AutoAWQ are usually fine). Some quant variants (e.g. early Llama 4 AWQ checkpoints) have tokenizer-alignment issues with tool calling — verify clean JSON output before committing. AWQ-INT3 exists but the quality drop is meaningful (~5-7% additional vs INT4); stay at INT4 unless VRAM is the absolute constraint.

GPTQ — the predecessor

GPTQ (Generative Pre-trained Transformer Quantization) was the dominant 4-bit format from 2023-2024. Uses approximate second-order Hessian information to choose quant values. NVIDIA-only.

When GPTQ matters in 2026: the model only ships with GPTQ quants and AWQ isn't available; you're running ExLlamaV2 (which uses GPTQ-derived EXL2 quants); you need 3-bit quantization (more aggressive than AWQ's 4-bit floor). For new vLLM deployments, AWQ is generally the better choice — the GPTQ kernel optimization gets less attention from the vLLM team.

EXL2 — the consumer-NVIDIA single-stream leader

EXL2 is the ExLlamaV2 format. Allows fractional bit-rates (e.g. 4.65 bits/weight) by mixing higher-precision weights for important channels with lower-precision elsewhere. NVIDIA-only.

The EXL2 case operationally: fastest single-stream tok/s on consumer NVIDIA. Beats vLLM AWQ by 10-25% per-stream throughput at the same VRAM target. The dual-3090 NVLink + ExLlamaV2 combo is one of the highest single-stream-throughput setups under $2,000 in 2026.

Compatibility cost: EXL2 doesn't support continuous batching the way vLLM does, so multi-tenant concurrency is weaker. EXL2 doesn't run on AMD or Apple. For solo-user or small-team deployments where peak per-stream tok/s matters: EXL2. For multi-user production serving: AWQ via vLLM.

FP8 — the H100 / Ada transformer-engine path

FP8 is 8-bit floating-point quantization that uses NVIDIA's Hopper or Ada transformer engine for hardware-accelerated dequantization. Two variants: E4M3 (4 exponent bits, 3 mantissa) for weights, E5M2 (5 exponent, 2 mantissa) for activations.

When FP8 wins: H100 / H200 / B100 production deployments where you need quality close to FP16 (FP8 typically hits ~1% quality loss vs FP16, vs ~2% for AWQ-INT4) AND can pay the 2× VRAM cost vs INT4. RTX 4090 and 5090 also support FP8 on the Ada/Blackwell transformer engine but the throughput uplift is smaller than on Hopper.

FP8 doesn't help on AMD or Apple. Some FP8 quants require precise vLLM/TensorRT-LLM versions — verify the quant's manifest before committing.

MLX-4bit / MLX-8bit — the Apple Silicon path

MLX quants are Apple-Silicon-native formats tuned for the Apple unified memory architecture. Different file format (no GGUF compatibility). Used by MLX-LM and MLX Swift.

Operational reality on Apple Silicon: MLX-4bit is the fastest path on M-series Macs for most models. llama.cpp Metal works (and uses GGUF) but trails MLX by 10-30% on tok/s for the same model. The trade-off: MLX-format models don't cross-deploy to Linux/Windows NVIDIA — you maintain dual quant variants if you ship to both Apple and CUDA targets.

Apple Foundation Models APIs (iOS 18+) ship pre-quantized variants for Apple Intelligence. These are system-bundled and not directly runnable through MLX-LM — they're Apple's closed-quant pipeline.

Marlin — the high-throughput INT4 kernel

Marlin is technically a CUDA kernel, not a quant format — but it's the kernel that makes AWQ-INT4 fast on vLLM. Mixed-precision (INT4 weights, FP16 activations) with hand-tuned CUDA tensor-core throughput.

You don't pick “Marlin” as a format; you pick AWQ-INT4 + vLLM 0.6+, and Marlin runs under the hood. Knowing it exists matters because: (1) it's why AWQ beats GPTQ on vLLM throughput, (2) Marlin-compatible quants have specific calibration constraints — non-standard group-size AWQ quants may fall back to slower kernels.

Quality vs VRAM tradeoff matrix

Approximate quality loss vs FP16 across formats (editorial estimate, varies by model):

Format	Bits/param	Quality loss vs FP16	VRAM savings vs FP16
FP16	16	baseline	baseline
FP8 (E4M3)	8	~1%	~2×
Q8_0 (GGUF)	8.5	~1%	~1.9×
Q6_K (GGUF)	6.5	~1.5%	~2.5×
Q5_K_M (GGUF)	5.7	~2%	~2.8×
AWQ-INT4	4.25	~2%	~3.7×
Q4_K_M (GGUF)	4.83	~2.5%	~3.3×
EXL2 4.65bpw	4.65	~2.5%	~3.4×
GPTQ-INT4	4.25	~3-5%	~3.7×
Q3_K_M (GGUF)	3.9	~5-8%	~4.1×
AWQ-INT3	3.25	~7-10%	~5×
Q2_K (GGUF)	3.0	~15%+	~5.3×

Operator rule of thumb: Q4_K_M / AWQ-INT4 is the knee of the curve — most VRAM savings for acceptable quality loss. Going below Q4 is rarely worth it on coding / reasoning workloads. Going above Q5 is only worth it when you have VRAM to burn and need every point of MMLU / GPQA quality.

Compatibility matrix by runtime

See the full runtime compatibility matrix for cross-axes. Format coverage:

Runtime	GGUF	AWQ	GPTQ	EXL2	FP8	MLX
llama.cpp / Ollama / LM Studio	✓	·	·	·	·	·
vLLM	·	✓	✓	·	✓	·
SGLang	·	✓	✓	·	✓	·
ExLlamaV2 / TabbyAPI	·	·	✓	✓	·	·
TensorRT-LLM	·	✓	·	·	✓	·
MLX-LM / MLX Swift	·	·	·	·	·	✓

How to pick: decision tree

What hardware? Apple Silicon → MLX-4bit. AMD GPU → GGUF Q4_K_M (ROCm path). Intel Arc / NPU → ONNX Runtime quants. NVIDIA → continue.
Multi-user serving (vLLM/SGLang)? → AWQ-INT4. The throughput-per-VRAM-dollar leader. FP8 only if H100 + 1% quality matters more than 2× VRAM cost.
Solo user, max single-stream tok/s on consumer NVIDIA? → EXL2 via ExLlamaV2 + TabbyAPI.
Cross-platform portability matters? → GGUF Q4_K_M. Runs on every backend.
VRAM is brutally tight? → drop to Q3_K_M or AWQ-INT3 with a quality hit. Don't go lower without measuring quality on your specific workload.

Quant-related failure modes

Tokenizer alignment broken on new quants. Some early-cycle AWQ checkpoints emit malformed JSON for tool calls. Test with a tool-calling prompt before committing to a quant variant.
FP8 quant pipeline mismatches. Some FP8 checkpoints require specific vLLM / TensorRT-LLM versions. Verify the quant's manifest.
MTP head silently dropped during quantization. DeepSeek V4's multi-token-prediction head can be lost during AWQ conversion if the pipeline doesn't handle it. Test throughput post-quant — losing MTP costs ~1.8× decode speed.
Quant + flash-attention version conflict. FA2 vs FA3 vs FlashInfer have subtly different kernel requirements; some quants compile only with specific FA versions. Pin all three.
K-quant kernel coverage gaps on AMD/Vulkan. Some K-quants (especially Q3 variants) lack optimized kernels on AMD ROCm or Vulkan. Stick to Q4_K_M / Q5_K_M / Q8_0 for non-CUDA backends until verified.

Going deeper

Quantization glossary entry — the broader concept
GGUF glossary entry
AWQ glossary entry
EXL2 glossary entry
Runtime compatibility matrix — full cross-axis
Setup path-finder — runtime + quant by your hardware
Multi-GPU buying guide