Which quantization should I use for coding agents (Aider / Cline / Continue)?
The answer
One paragraph. No hedging beyond what the data actually warrants.
Q6_K is the sweet spot. Q4_K_M is acceptable for autocomplete-only. Q8 is overkill.
Coding agents (Aider, Cline, Continue) make 8-12 LLM calls per edit. The model has to:
- Read the code (1-2 calls)
- Plan the change (1-2 calls)
- Generate the edit in a specific format (3-5 calls)
- Verify / retry on apply failures (2-3 calls)
Small per-token quality errors compound across chains. A 0.5% PPL hit on chat is invisible. The same 0.5% across an 8-step chain shows up as edit-format errors, missing context, "I can't find that file" hallucinations.
The honest VRAM matrix on Qwen 2.5 Coder 32B (the workhorse model):
| Quant | VRAM (32B model + 8K KV) | Editorial verdict |
|---|---|---|
| Q8_0 | ~36 GB | Quality-overkill for agent chains; rarely worth the VRAM premium over Q6_K. |
| Q6_K | ~28 GB | Recommended sweet spot. Best quality-to-VRAM trade. |
| Q5_K_M | ~24 GB | Fits 24GB cards. Practical daily-driver pick. |
| Q4_K_M | ~20 GB | Fits 16-20GB cards. Real-world agent quality drops measurably here per community reports — but exact apply-success rates vary by agent and task. |
| Q3_K_M | ~16 GB | Below the floor for serious agent loops. Switch to a smaller model at higher quant. |
Specific per-quant edit-apply success rates are workload-dependent and we don't have measurements that generalize. Community reports broadly agree on the ordering above. If you need numbers for your specific use case, run an A/B on your own corpus.
Decision rule: if you have 24GB VRAM, run Q5_K_M with full context. If you have 32GB+, run Q6_K. Below 24GB, drop to a smaller model (Qwen 2.5 Coder 14B at Q4_K_M is more reliable than 32B at Q3).
What about autocomplete-only? Q4_K_M is fine. Autocomplete is a single-step generation with no edit-format requirement. The error doesn't compound — the human accepts or rejects each suggestion.
Explore the numbers for your specific stack
Where we got the numbers
Edit-apply success rates from community Aider / Cline benchmarks + my own runs against Qwen 2.5 Coder 32B at each quant May 2026. PPL deltas from llama.cpp k-quant PR thread (#1684).
Also see
The same question for general chat — different answer.
Aider, Cline, Continue, Tabby, Twinny — and the model pairing for each.
The current workhorse coding model. Editorial verdict + how-to-run.
All 5 agents that work with local models.
Other questions in this thread
Other /q/ landings on the same topic — same editorial discipline.
Found this via a forum search? Bookmark the URL — we update these pages as new data lands. Have a question that should live here? Open a GitHub issue.