Which quantization should I use for coding agents (Aider / Cline / Continue)?

The answer

One paragraph. No hedging beyond what the data actually warrants.

Q6_K is the sweet spot. Q4_K_M is acceptable for autocomplete-only. Q8 is overkill.

Coding agents (Aider, Cline, Continue) make 8-12 LLM calls per edit. The model has to:

Read the code (1-2 calls)
Plan the change (1-2 calls)
Generate the edit in a specific format (3-5 calls)
Verify / retry on apply failures (2-3 calls)

Small per-token quality errors compound across chains. A 0.5% PPL hit on chat is invisible. The same 0.5% across an 8-step chain shows up as edit-format errors, missing context, "I can't find that file" hallucinations.

The honest VRAM matrix on Qwen 2.5 Coder 32B (the workhorse model):

Quant	VRAM (32B model + 8K KV)	Editorial verdict
Q8_0	~36 GB	Quality-overkill for agent chains; rarely worth the VRAM premium over Q6_K.
Q6_K	~28 GB	Recommended sweet spot. Best quality-to-VRAM trade.
Q5_K_M	~24 GB	Fits 24GB cards. Practical daily-driver pick.
Q4_K_M	~20 GB	Fits 16-20GB cards. Real-world agent quality drops measurably here per community reports — but exact apply-success rates vary by agent and task.
Q3_K_M	~16 GB	Below the floor for serious agent loops. Switch to a smaller model at higher quant.

Specific per-quant edit-apply success rates are workload-dependent and we don't have measurements that generalize. Community reports broadly agree on the ordering above. If you need numbers for your specific use case, run an A/B on your own corpus.

Decision rule: if you have 24GB VRAM, run Q5_K_M with full context. If you have 32GB+, run Q6_K. Below 24GB, drop to a smaller model (Qwen 2.5 Coder 14B at Q4_K_M is more reliable than 32B at Q3).

What about autocomplete-only? Q4_K_M is fine. Autocomplete is a single-step generation with no edit-format requirement. The error doesn't compound — the human accepts or rejects each suggestion.

Explore the numbers for your specific stack

Open /quant-advisor for Qwen 2.5 Coder 32B →

Quality curve + VRAM fit across all 13 quants. Switch model dropdown to compare with Llama 3.3 70B or Phi-4.

Where we got the numbers

Edit-apply success rates from community Aider / Cline benchmarks + my own runs against Qwen 2.5 Coder 32B at each quant May 2026. PPL deltas from llama.cpp k-quant PR thread (#1684).

Also see

Q4 vs Q6 in general? →

The same question for general chat — different answer.

Which coding agent should I use? →

Aider, Cline, Continue, Tabby, Twinny — and the model pairing for each.

Qwen 2.5 Coder 32B model page →

The current workhorse coding model. Editorial verdict + how-to-run.

Coding agents directory →

All 5 agents that work with local models.

Which quantization should I use for coding agents (Aider / Cline / Continue)?

The answer

Explore the numbers for your specific stack

Where we got the numbers

Also see

Other questions in this thread