Q4_K_M Quantization

Q4_K_M is the most-downloaded GGUF quantization on Hugging Face — the default tradeoff for local inference. It mixes 6-bit precision on the most sensitive layers (attention output, FFN gate) with 4-bit elsewhere, plus a per-row importance matrix learned during conversion.

Per-parameter cost averages ~4.83 bits (not 4 — naive sizing under-predicts file size by ~20%). A 7B model is ~4.4 GB, a 13B is ~7.9 GB, a 70B is ~42 GB. Perplexity vs FP16 is typically 0.1–0.2 points — invisible in chat, slightly visible on coding/math benchmarks.

Use Q4_K_M as the default. Step up to Q5_K_M only with VRAM headroom; step down to Q3_K_M only when desperate.

An operator wants to run Llama 3.1 8B on a 12 GB RTX 3060 with room left for KV cache at an 8K context window. FP16 would need ~16 GB — a nonstarter. Q4_K_M brings the weights to roughly 4.9 GB, leaving ~7 GB for context and OS overhead, comfortably inside the 12 GB budget. They pull the Q4_K_M.gguf from a Hugging Face repo, load it in llama.cpp or Ollama, and run a few prompts against the FP16 outputs side by side — the answers are indistinguishable for chat and summarization, with only minor wobble on multi-step arithmetic. That's the expected tradeoff: for anything short of a benchmark run or agentic coding pipeline, Q4_K_M is the format to reach for first, and only after confirming it fits do they consider Q5_K_M if VRAM allows.

Reviewed by Eruo Fredoline. See our editorial policy.

When it doesn't work

Practical example

Related terms

See also