Quantization tradeoffs
Editorial

FP16 vs Q8 vs Q5 vs Q4 vs Q3 vs Q2

Quantization compresses model weights by reducing bits per parameter. Less memory, more speed, less quality. The tradeoff curve isn't linear — Q8 is nearly free; Q4 is the production sweet spot; Q3 starts breaking; Q2 is mostly a toy.

Dimension
FP16/BF16
Reference
Q8/INT8
Near-lossless
Q5/5-bit
Quality-tilt
Q4/INT4
Default deploy
Q3/3-bit
Aggressive
Q2/2-bit
Last resort
Memory footprint
Approx. weight size for a 70B-parameter model.
Limited
≈140 GB. Multi-GPU only.
Acceptable
≈70 GB. Two-card 48 GB territory.
Strong
≈48 GB. Fits a single 48 GB card or split.
Excellent
≈40 GB. The 70B-on-one-pro-card sweet spot.
Excellent
≈32 GB. Squeezes 70B onto consumer + headroom.
Excellent
≈22 GB. Fits but quality is uneven.
Output quality (instruction-following)
Operator-perceivable quality on chat / coding tasks.
Excellent
Reference. Anything else is measured against this.
Excellent
Within 1-2% of FP16 on most benchmarks; effectively lossless for chat.
Strong
Quality holds. Some perceptible drift on complex coding.
Strong
Production-acceptable for chat/RAG. K-quants (Q4_K_M) noticeably better than older Q4_0.
Acceptable
Visible degradation; usable for casual chat, not for code.
Limited
Hallucinations + reasoning collapse common; experimental tier.
Decode speed (tok/s)
Lower-bit weights = faster decode (memory-bound regime).
Acceptable
Slowest among compute-rich GPUs; memory bandwidth bound.
Strong
≈1.5-1.8x FP16 on consumer hardware.
Strong
≈2-2.4x FP16; the K-quants are kernel-optimized.
Excellent
≈2.5-3x FP16; dominant production tier.
Excellent
≈3-3.5x FP16 if kernels exist; gains often outweighed by quality loss.
Excellent
Fastest but only matters if quality is acceptable for the task.
Runtime support
Which engines have stable kernels.
Excellent
Universal.
Excellent
All major runtimes.
Strong
GGUF + ExLlamaV2; AWQ has Q4 only.
Excellent
GGUF Q4_K_M, AWQ-INT4, GPTQ, EXL2-4bpw.
Acceptable
GGUF + EXL2; mainstream kernel coverage incomplete.
Limited
GGUF only; experimental.
Multi-GPU friendliness
How well the quant splits across cards.
Strong
Tensor-parallel native.
Strong
Tensor-parallel works in vLLM.
Acceptable
Layer-split common; tensor-parallel less universal.
Strong
AWQ-INT4 and GPTQ work with vLLM tensor-parallel.
Limited
Layer-split only on most engines.
Limited
Layer-split only.
Operator recommendation
What we'd default to for a given goal.
Limited
Avoid unless your hardware has the VRAM and you need reference quality.
Strong
Best for reproducing leaderboard numbers; preserve quality at 2x cost.
Strong
Quality-tilted production deploy; matters when output is read by humans.
Excellent
Default for 90% of operators. Production quality, sane VRAM.
Acceptable
When you must squeeze in; expect coding regression.
Limited
Toy / experimental. Don't ship.

Operator defaults

If quality matters (coding, agents, knowledge work): start at Q5 or Q8. Drop to Q4 only after measuring against your task.

If you're VRAM-bound: Q4_K_M is the universal default. Don't sub-Q4 without testing on YOUR workload.

If you're shipping a product: measure quality on representative inputs at Q8 vs Q4 vs Q5 and pick the smallest one your users won't notice. Don't take leaderboards at face value — chat-quality is workload-specific.