RUNLOCALAIv38
->Will it run?Best GPUCompareTroubleshootStartLearnPulseModelsHardwareToolsBench
Run check
RUNLOCALAI

Independently operated catalog for local-AI hardware and software. Hand-written verdicts. Source-cited claims. Reproducible commands when we have them.

OP·Fredoline Eruo
DIR
  • Models
  • Hardware
  • Tools
  • Benchmarks
TOOLS
  • Will it run?
  • Compare hardware
  • Cost vs cloud
  • Choose my GPU
  • Prompting kits
  • Quick answers
REF
  • All buyer guides
  • Learn local AI
  • Methodology
  • Glossary
  • Errors KB
  • Trust
EDITOR
  • About
  • Author
  • How we make money
  • Editorial policy
  • Contact
LEGAL
  • Privacy
  • Terms
  • Sitemap
MAIL · MONTHLY DIGEST
Get monthly local AI changes
Monthly recap. No spam.
DISCLOSURE

Some links on this site are affiliate links (Amazon Associates and other first-class retailers). When you buy through them, we earn a small commission at no extra cost to you. Affiliate links do not influence our verdicts — there are cards we rate highly that we don't have affiliate relationships with, and cards that sell well that we refuse to recommend. Read more →

© 2026 runlocalai.coIndependently operated
RUNLOCALAI · v38
  1. >
  2. Home
  3. /Compare
  4. /Quantization
Quantization tradeoffs
✓Editorial

FP16 vs Q8 vs Q5 vs Q4 vs Q3 vs Q2

Quantization compresses model weights by reducing bits per parameter. Less memory, more speed, less quality. The tradeoff curve isn't linear — Q8 is nearly free; Q4 is the production sweet spot; Q3 starts breaking; Q2 is mostly a toy.

Dimension
FP16/BF16
Reference
Q8/INT8
Near-lossless
Q5/5-bit
Quality-tilt
Q4/INT4
Default deploy
Q3/3-bit
Aggressive
Q2/2-bit
Last resort
Memory footprint
Approx. weight size for a 70B-parameter model.
Limited
≈140 GB. Multi-GPU only.
Acceptable
≈70 GB. Two-card 48 GB territory.
Strong
≈48 GB. Fits a single 48 GB card or split.
Excellent
≈40 GB. The 70B-on-one-pro-card sweet spot.
Excellent
≈32 GB. Squeezes 70B onto consumer + headroom.
Excellent
≈22 GB. Fits but quality is uneven.
Output quality (instruction-following)
Operator-perceivable quality on chat / coding tasks.
Excellent
Reference. Anything else is measured against this.
Excellent
Within 1-2% of FP16 on most benchmarks; effectively lossless for chat.
Strong
Quality holds. Some perceptible drift on complex coding.
Strong
Production-acceptable for chat/RAG. K-quants (Q4_K_M) noticeably better than older Q4_0.
Acceptable
Visible degradation; usable for casual chat, not for code.
Limited
Hallucinations + reasoning collapse common; experimental tier.
Decode speed (tok/s)
Lower-bit weights = faster decode (memory-bound regime).
Acceptable
Slowest among compute-rich GPUs; memory bandwidth bound.
Strong
≈1.5-1.8x FP16 on consumer hardware.
Strong
≈2-2.4x FP16; the K-quants are kernel-optimized.
Excellent
≈2.5-3x FP16; dominant production tier.
Excellent
≈3-3.5x FP16 if kernels exist; gains often outweighed by quality loss.
Excellent
Fastest but only matters if quality is acceptable for the task.
Runtime support
Which engines have stable kernels.
Excellent
Universal.
Excellent
All major runtimes.
Strong
GGUF + ExLlamaV2; AWQ has Q4 only.
Excellent
GGUF Q4_K_M, AWQ-INT4, GPTQ, EXL2-4bpw.
Acceptable
GGUF + EXL2; mainstream kernel coverage incomplete.
Limited
GGUF only; experimental.
Multi-GPU friendliness
How well the quant splits across cards.
Strong
Tensor-parallel native.
Strong
Tensor-parallel works in vLLM.
Acceptable
Layer-split common; tensor-parallel less universal.
Strong
AWQ-INT4 and GPTQ work with vLLM tensor-parallel.
Limited
Layer-split only on most engines.
Limited
Layer-split only.
Operator recommendation
What we'd default to for a given goal.
Limited
Avoid unless your hardware has the VRAM and you need reference quality.
Strong
Best for reproducing leaderboard numbers; preserve quality at 2x cost.
Strong
Quality-tilted production deploy; matters when output is read by humans.
Excellent
Default for 90% of operators. Production quality, sane VRAM.
Acceptable
When you must squeeze in; expect coding regression.
Limited
Toy / experimental. Don't ship.

Operator defaults

If quality matters (coding, agents, knowledge work): start at Q5 or Q8. Drop to Q4 only after measuring against your task.

If you're VRAM-bound: Q4_K_M is the universal default. Don't sub-Q4 without testing on YOUR workload.

If you're shipping a product: measure quality on representative inputs at Q8 vs Q4 vs Q5 and pick the smallest one your users won't notice. Don't take leaderboards at face value — chat-quality is workload-specific.

Next steps

Compare runtimes
OrCompare hardware tiersBrowse benchmarks