Quantization issues

Q2_K or Q3 quantized model produces nonsense

(no error — output is incoherent at Q2_K but fine at Q4_K_M)
By Fredoline Eruo · Last verified May 6, 2026

Cause

Q2_K is too aggressive for most models below ~30B parameters. The 2-bit quantization causes severe enough quality degradation that the model becomes incoherent — looks fluent but says nothing meaningful, makes math errors, contradicts itself.

For 7B-13B models, Q4_K_M is the practical floor. Q3_K_M is borderline. Q2_K is only useable on 70B+ where there's enough redundancy in the weights to absorb the quality loss.

Solution

Drop to Q4_K_M minimum for any model under 30B:

ollama pull llama3.1:8b-instruct-q4_K_M

For 70B-class models where you legitimately need Q2_K to fit on consumer hardware, expect noticeable quality drop on:

  • Multi-step reasoning (math, planning)
  • Code generation correctness
  • Strict instruction following

Better alternative for tight VRAM: an MoE model. Qwen 3 30B-A3B at Q4_K_M (18 GB) outperforms Llama 70B at Q2_K (24 GB) on most tasks because MoE active parameters retain quality.

Or use CPU offload instead of aggressive quantization:

# Llama 3.3 70B at Q4_K_M with 30/80 layers offloaded to CPU
./main -m llama-3.3-70b.Q4_K_M.gguf --n-gpu-layers 30

Slower (~12 tok/s instead of 35) but coherent.

Did this fix it?

If your case was different, email hello@runlocalai.co with what you saw and we'll update the page. If it worked but took different commands on your platform, we want to know that too.