RUNLOCALAIv38
->Will it run?Best GPUCompareTroubleshootStartLearnPulseModelsHardwareToolsBench
Run check
RUNLOCALAI

Independently operated catalog for local-AI hardware and software. Hand-written verdicts. Source-cited claims. Reproducible commands when we have them.

OP·Fredoline Eruo
DIR
  • Models
  • Hardware
  • Tools
  • Benchmarks
TOOLS
  • Will it run?
  • Compare hardware
  • Cost vs cloud
  • Choose my GPU
  • Prompting kits
  • Quick answers
REF
  • All buyer guides
  • Learn local AI
  • Methodology
  • Glossary
  • Errors KB
  • Trust
EDITOR
  • About
  • Author
  • How we make money
  • Editorial policy
  • Contact
LEGAL
  • Privacy
  • Terms
  • Sitemap
MAIL · MONTHLY DIGEST
Get monthly local AI changes
Monthly recap. No spam.
DISCLOSURE

Some links on this site are affiliate links (Amazon Associates and other first-class retailers). When you buy through them, we earn a small commission at no extra cost to you. Affiliate links do not influence our verdicts — there are cards we rate highly that we don't have affiliate relationships with, and cards that sell well that we refuse to recommend. Read more →

© 2026 runlocalai.coIndependently operated
RUNLOCALAI · v38
  1. >
  2. Home
  3. /Troubleshooting
  4. /Quantized model: noticeable quality loss / repetition / coherence drop
warning✓Editorial·Reviewed May 2026

Quantization quality loss — when the quant is the problem

Output quality drop after quantization usually means the bpw is too aggressive, KV cache quantization is too low, or the calibration data didn't match the model. Q4_K_M is the safe floor; below that needs care.

llama.cpp GGUFExLlamaV2 EXL2AWQGPTQany quantized inference
By Fredoline Eruo · Last verified 2026-05-08

Diagnostic order — most likely first

#1

Quantization tier too aggressive (Q3, Q2, IQ2)

Diagnose

Model produces incoherent text, repetitive loops, or off-topic responses. Worse than the FP16 baseline.

Fix

Bump up: Q3 → Q4_K_M is the cleanest jump. IQ2 → Q4_K_M nearly always improves. Q4_K_M is the modern standard floor — anything below that risks quality.

#2

KV cache quantized too aggressively

Diagnose

Long-context coherence drops. Output starts good, degrades at 4K+ context. KV cache at Q4 hurts attention precision more than weight quant does.

Fix

Use FP16 or Q8 KV cache, not Q4. In llama.cpp: `--cache-type-k q8_0 --cache-type-v q8_0` (Q8 is the comfortable cache floor). Don't quantize KV below Q8 for long context.

#3

Wrong quantization for the model architecture

Diagnose

Some models (Mixture-of-Experts, models with long shared embeddings) lose more quality at the same bpw than dense models. Output quality drops disproportionately.

Fix

Use higher bpw for MoE / non-standard architectures. Qwen 3 235B-A22B (MoE) needs Q5+ for stability. Dense Llama 70B is fine at Q4_K_M.

#4

Calibration data mismatch (AWQ / GPTQ specific)

Diagnose

AWQ / GPTQ quants calibrated on English text underperform on code, multilingual, or specialized domains.

Fix

Find a quant calibrated on relevant data. Or fall back to GGUF Q4_K_M (calibration-free). For code workloads, prefer code-calibrated quants.

#5

Comparing to a fine-tune you didn't actually quantize

Diagnose

You're using a quantized 'base' model and expecting the fine-tune behavior. The fine-tune lives on top; if you didn't quantize it specifically, the fine-tune's behavior is gone.

Fix

Check the model card. Many GGUF / EXL2 repos quantize the base, not the fine-tune. Find a quant of the specific fine-tune you want, or quantize it yourself.

Frequently asked questions

What's the safe minimum quantization for production?

Q4_K_M (GGUF) or 4.0 bpw (EXL2). Below this, quality degrades enough to be noticeable on adversarial prompts. Q5_K_M is the comfort zone for high-stakes work; Q8 is essentially lossless.

How do I measure quantization quality objectively?

Run perplexity on a held-out test set (`./llama-perplexity`) — lower is better. Compare your quant's PPL to the FP16 baseline; >1% increase is meaningful. For chat models, qualitative testing on diverse prompts is essential.

GGUF vs EXL2 vs AWQ — which has best quality at the same bpw?

Roughly equivalent at 4.0+ bpw. Below 4.0, EXL2's calibration tends to outperform GGUF Q3. AWQ uses activation-aware calibration that helps on specific architectures. For most users, GGUF Q4_K_M is the practical default.

Related troubleshooting

Model keeps crashing / segfault during inference

Mid-inference crashes (segfault, illegal memory access, kernel panic) usually mean VRAM ECC, thermal throttling, PSU instability, or a bad model file. Here's the diagnostic order.

GGUF tokenizer mismatch / 'tokenizer model not found'

When llama.cpp / Ollama outputs garbled text or repeats tokens infinitely, the tokenizer baked into the GGUF doesn't match the runtime's expectations. Here's how to confirm and fix.

ExLlamaV2: model not loading / 'Could not find model index' / cache OOM

ExLlamaV2 load failures trace to wrong model format (needs EXL2 or EXL3, not GGUF), insufficient cache for context, or a driver/runtime version mismatch. The exl2 format is non-negotiable.

When the fix is hardware

A surprising fraction of troubleshooting tickets resolve to: this card doesn't have enough VRAM for what you're asking it to do. If you're hitting OOM after every reasonable fix, or your GPU genuinely can't fit the model you need, it's upgrade time:

  • Best GPU for local AI
  • Best laptop for local AI
  • Best Mac for local AI

Where next?

All troubleshooting guides
OrBest GPU for local AIWill it run on my hardware?