RUNLOCALAIv38
->Will it run?Best GPUCompareTroubleshootStartLearnPulseModelsHardwareToolsBench
Run check
RUNLOCALAI

Independently operated catalog for local-AI hardware and software. Hand-written verdicts. Source-cited claims. Reproducible commands when we have them.

OP·Fredoline Eruo
DIR
  • Models
  • Hardware
  • Tools
  • Benchmarks
TOOLS
  • Will it run?
  • Compare hardware
  • Cost vs cloud
  • Choose my GPU
  • Prompting kits
  • Quick answers
REF
  • All buyer guides
  • Learn local AI
  • Methodology
  • Glossary
  • Errors KB
  • Trust
EDITOR
  • About
  • Author
  • How we make money
  • Editorial policy
  • Contact
LEGAL
  • Privacy
  • Terms
  • Sitemap
MAIL · MONTHLY DIGEST
Get monthly local AI changes
Monthly recap. No spam.
DISCLOSURE

Some links on this site are affiliate links (Amazon Associates and other first-class retailers). When you buy through them, we earn a small commission at no extra cost to you. Affiliate links do not influence our verdicts — there are cards we rate highly that we don't have affiliate relationships with, and cards that sell well that we refuse to recommend. Read more →

© 2026 runlocalai.coIndependently operated
RUNLOCALAI · v38
  1. >
  2. Home
  3. /Learn
  4. /Courses
  5. /Understanding AI Models
  6. /Ch. 5
Understanding AI Models

05. Q4_K_M vs Q8_0 vs Q2_K

Chapter 5 of 20 · 15 min
KEY INSIGHT

Q4_K_M hits the sweet spot for most users-significant VRAM savings with only 5% quality loss compared to FP16.

The naming schemes for quantized models can be opaque. This chapter decodes the most common formats and their tradeoffs.

Format breakdown:

Q[bits]_[type]
  • Q4_K_M: 4-bit with K-quantization, medium optimization
  • Q8_0: 8-bit with no K-quantization, near-FP16 quality
  • Q2_K: 2-bit with K-quantization, minimal quality

K-quantization explained:

K-quantization uses different precision for different parameter groups. The most sensitive parameters get higher precision. The letter after K (M, S, etc.) indicates the optimization level-M is medium, S is small.

For example, Q4_K_M typically keeps:

  • 20% of weights in Q6_K (6-bit per group)
  • 80% of weights in Q4_K (4-bit per group)

VRAM comparison for 7B model:

Format VRAM (approx) Relative quality
FP16 14GB Baseline (100%)
Q8_0 7GB ~99% of FP16
Q5_K_M 4.9GB ~97% of FP16
Q4_K_M 4.1GB ~95% of FP16
Q3_K_M 3.2GB ~90% of FP16
Q2_K 2.7GB ~85% of FP16

When to use each:

Q8_0: When you have enough VRAM and want quality as close to FP16 as possible. Good for final production runs if memory allows.

Q4_K_M: The default recommendation. Offers the best quality-per-VRAM ratio. Works for most tasks-coding, reasoning, chat. If you are unsure, start here.

Q2_K: Only when VRAM is severely limited. Output quality degrades noticeably. Useful for quickly testing a large model you cannot otherwise run.

The imatrix factor:

Some quantizations use an "importance matrix" (imatrix) computed from representative samples. This measures which weights matter most for accuracy and preserves them better. GGUF quantizations with _Q4_K_M often have a reference imatrix derived from the original model's calibration data.

# Example GGUF file naming
llama-3-8b-instruct-q4_k_m.gguf
# This file uses K-quantization at 4-bit with medium optimization
EXERCISE

Download a small 3B model and convert it to both Q8_0 and Q4_K_M using llama.cpp. Compare file sizes and run a simple benchmark to measure quality difference on a known task.

← Chapter 4
Quantization Explained
Chapter 6 →
KV Cache and VRAM