Qwen 2.5 Coder 32B Instruct
Coding-specialist Qwen 2.5. Beats GPT-4o on HumanEval and matches Sonnet on many code-edit benchmarks. The default local-coding model on 24GB cards.
Positioning
The model to run if you want a Cursor / Copilot replacement on your own hardware. Qwen 2.5 Coder 32B is the headline open-weight coding model — strong fill-in-the-middle, strong repo-scale reasoning, fast enough on a 4090 to keep up with interactive editing.
Strengths
- Fill-in-the-middle is genuinely good — the actual mechanism Cursor and Copilot rely on, not just chat-style code completion.
- Repo-aware reasoning — handles 32K-context code review tasks credibly; instruction-tuned to navigate multi-file context.
- 70–88 tok/s on 4090 Q4 — fast enough for interactive code-as-you-type once integrated with a properly streaming editor plugin.
Limitations
- Qwen license MAU cap is a real concern for SaaS deployments.
- Lags closed models on novel-architecture tasks — anything genuinely outside its training distribution still falls back to plausible-but-wrong patterns.
- Repo-context isn't free — feeding a real codebase still requires good RAG or AST-aware chunking; the model alone won't fix bad context selection.
Real-world performance on RTX 4090
- Q4_K_M (19 GB): 70–88 tok/s decode, TTFT ~140 ms
- Q5_K_M (22.6 GB): 58–72 tok/s
- Q8_0 (35 GB): partial offload, 18–25 tok/s
Should you run this locally?
Yes, for any developer with an RTX 3090 / 4090 / 5080+ who wants Copilot-class autocomplete without the cloud round-trip. The headline win for local AI. No, for developers comfortable with closed services for $10–20/month — for novel languages or rare frameworks, GPT-4 / Claude still produce more reliable code.
How it compares
- vs DeepSeek Coder V2 Lite → Qwen 2.5 Coder 32B is meaningfully stronger; DeepSeek Coder V2 Lite (16B) is the right pick under 16 GB VRAM.
- vs Codestral 22B → Qwen 2.5 Coder 32B wins on capability; Codestral has cleaner Mistral license terms.
- vs Qwen 2.5 32B Instruct → Coder is dramatically better at coding; pick Instruct for general chat.
- vs DeepSeek V3 / R1 → V3 and R1 are stronger at hard reasoning but uncommonly large for single-card use.
Run this yourself
ollama pull qwen2.5-coder:32b-instruct-q4_K_M
ollama run qwen2.5-coder:32b-instruct-q4_K_M
Settings: Q4_K_M GGUF, 16384 ctx, full GPU on 4090
Editor integration: Continue.dev or Tabby with Ollama backend
›Why this rating
9.2/10 — the strongest open-weight coding model that runs on a single 24 GB GPU. Genuinely competitive with closed coding models (GPT-4, Claude) on most non-frontier tasks. The only reason it loses points is the Qwen license MAU cap.
Overview
Coding-specialist Qwen 2.5. Beats GPT-4o on HumanEval and matches Sonnet on many code-edit benchmarks. The default local-coding model on 24GB cards.
Featured in these stacks
The L3 execution stacks that pick this model as a recommended component, with the one-line note explaining the role it plays in each.
- Stack · L3·Workstation tier·Role: Coding model (the actual brain)Build a local coding-agent stack (May 2026)
Qwen 2.5 Coder 32B Instruct is the strongest open coding model in the 32B class as of May 2026 — beats DeepSeek Coder V2 Lite on HumanEval+ and SWE-Bench Lite at the same VRAM footprint. AWQ-INT4 fits on a 24GB card with headroom for a 32K context window.
- Stack · L3·Workstation tier·Role: Coding model (32B class)Build an RTX 4090 AI workstation stack (May 2026)
Qwen 2.5 Coder 32B AWQ-INT4 is the strongest model that fits 24GB with real context room — beats DeepSeek Coder V2 Lite on coding benchmarks at the same VRAM budget. Reserve 8-10GB of VRAM for KV cache; 32K context is the sweet spot.
- Stack · L3·Workstation tier·Role: Coding model (single-Mac primary)Build a Mac-native AI stack (May 2026)
Qwen 2.5 Coder 32B in MLX-4bit quant runs comfortably on a 64GB M3 Max with room for 32K context. Beats DeepSeek Coder V2 Lite on coding benchmarks at the same memory footprint.
- Stack · L3·Workstation tier·Role: Coding modelBuild a fully offline coding stack (May 2026)
Qwen 2.5 Coder 32B AWQ-INT4 fits 24GB with 32K context — strongest open coding model in the 32B class as of May 2026. Apache 2.0 license: usable in any environment without licensing surprises. Pre-stage the AWQ weights locally before egress lockdown.
- Stack · L3·Workstation tier·Role: 32B coding model that fits with concurrency roomDual RTX 3090 workstation stack — 70B-class on $1,800 of used GPUs
32B class on dual-3090 leaves significant headroom — fits 8K context AND serves 8-16 concurrent coding-agent loops via vLLM continuous batching. The right pick when the workload is coding-tier rather than chat-tier.
- Stack · L3·Homelab tier·Role: Coding agent serving for 16+ concurrent usersQuad RTX 3090 workstation stack — the prosumer 100B-class ceiling
32B-class coding model on quad-3090 leaves enormous headroom — vLLM continuous batching serves 16+ concurrent coding-agent loops at ~30 tok/s each. The team-tier coding-serve config.
- Stack · L3·Homelab tier·Role: Coding model (better fit for this combo)Mixed RTX 4090 + 3090 workstation — the asymmetric upgrade path
32B-class fits with substantial headroom on the asymmetric pair. The smaller model amplifies the mismatch less than 70B (fewer cross-card transitions per token).
Featured in this workflow
Full-system workflows that include this model as part of their service ledger — with the one-line operator note for each.
- Workflow · System·homelab·Role: Coding LLMLocal coding-agent system
SWE-Bench Lite competitive at the 32B size class, Apache 2.0 license, strong tool-calling discipline. AWQ-INT4 fits 24 GB with 32K context.
Execution notes
Operator notes
Qwen 2.5 Coder 32B Instruct is the canonical local coding model in May 2026 for the workstation tier. It's the model that turned /stacks/local-coding-agent from "research demo" into "production-grade autonomous coding agent on a single 4090."
What makes it the operator default:
- AWQ-INT4 fits a 24 GB card with 32K context with comfortable KV-cache headroom for memory-injection patterns (Mem0 + MCP postgres add 2-5K tokens of system prompt).
- Apache 2.0 license — no commercial-use friction.
- Strong tool-calling discipline — emits OAI-shaped tool calls reliably; OpenHands / OpenClaw / Goose / Aider all report low parse-error rates.
- SWE-Bench Lite competitive — within 5 points of the closed-source flagships at the 32B size class.
Deployment notes
The /stacks/local-coding-agent recipe pairs this model with vLLM + RTX 4090 + Mem0 + filesystem/git MCP. That's the configuration that hits ~38 tok/s decode and 60-180 second end-to-end iteration on bugfixes.
For 16 GB VRAM cards, drop to Qwen 2.5 Coder 14B — same family, fits 12 GB with 8K context. The 7B variant exists but trails by ~12 points on SWE-Bench, which is the threshold where autonomous-agent quality drops noticeably.
For team-shared workstation deployments (5+ users), SGLang wins over vLLM because OpenHands and OpenClaw both make 5-15 tool calls per task with a stable system prompt — RadixAttention's prefix cache compounds the wins across the cluster.
Runtime compatibility
- vLLM ✓ excellent. AWQ-INT4 quant supported out of the box; recommended for the production-default path.
- SGLang ✓ excellent. Same quant format; pick over vLLM when prefix-cache hit rate >50%.
- Ollama ✓ good. Q4_K_M GGUF available; loses concurrency benefits vs vLLM but wins on solo-developer setup time.
- MLX-LM ✓ good (MLX-4bit). Apple Silicon path; expect ~30% throughput drop vs RTX 4090 but unified memory means 32K context holds without VRAM contention.
- TensorRT-LLM ✗ partial. Compiles but the recompile-per-config friction kills agent-loop iteration speed. Use only when committed to the NVIDIA stack.
Quantization suitability
AWQ-INT4 is the production-recommended quant. Q4_K_M GGUF is the alternative for llama.cpp / Ollama deployments; quality loss vs FP16 is 2% on coding benchmarks. Avoid Q3-class quants — the quality drop on coding tasks is meaningful (6-8% HumanEval+ regression).
For 32K context + 4 concurrent agents on a single card, drop to AWQ-INT3 — fits with extra headroom but expect ~5% additional quality loss.
Best use cases
- Autonomous coding agents — OpenHands / OpenClaw / Goose paired with Mem0 for cross-session memory. The /stacks/local-coding-agent canonical setup.
- Surgical-edit workflows — Aider for tight-control git-integrated edits; same model, different paradigm.
- Multi-file refactoring — strong on SWE-Bench-shape "rename concept across N files" tasks.
- Test-suite stabilization — reasoning depth handles flaky-test investigation well; pairs with MCP-git for commit-history grounding.
When to use a different model
- Reasoning-first workloads (math, multi-step proof, complex algorithm design): use DeepSeek R1 Distill Qwen 32B or QwQ 32B — the explicit reasoning-token emission produces better plans on complex problems even though it costs 2-5x more tokens per query.
- 16 GB VRAM tier: drop to Qwen 2.5 Coder 14B.
- Frontier-tier capability: cluster-deploy DeepSeek V4 or use Anthropic API — the 32B class hits a ceiling on the most complex multi-file architecture decisions.
- Reproducibility-sensitive research: OpenCoder 8B — fully-open training data + recipes.
Failure modes specific to this model
- Token-limit truncation on long generations. Default vLLM `max_tokens` is conservative; coding tasks producing large diffs need 2048+ to avoid truncation mid-edit.
- Tool-call format misfires under high temperature. Keep temperature ≤0.3 for tool-calling workflows; ≤0.6 for chat.
- Reasoning-tag confusion. Qwen 2.5 Coder doesn't emit
<think>blocks (no reasoning-mode toggle here — that's Qwen 3). If your agent harness expects them, you'll see empty reasoning sections.
Going deeper
- /stacks/local-coding-agent — canonical deployment recipe for this model
- /stacks/memory-enabled-agent — memory-enabled variant with Mem0 + Letta
- vLLM operational review — the runtime-specific operator detail
- /systems/agent-execution-systems — the architectural depth on agent loops
Family & lineage
How this model relates to others in its lineage. Family members share architecture and training-data roots; parent / children edges record direct distillation or fine-tune relationships.
Strengths
- Best open-weight coder at release
- Apache 2.0
- Strong fill-in-middle
Weaknesses
- Less strong on general chat than non-coder Qwen
Prompting kit
Tested patterns for getting the most out of Qwen 2.5 Coder 32B Instruct locally. Local models are pickier about prompt structure than cloud models — what works on Claude or GPT-5 often fails here.
Recommended system prompt
You are Qwen, a senior software engineer created by Alibaba Cloud. Write clean, idiomatic code. Explain your reasoning before the code. When asked for code, prefer correctness over cleverness, and add comments only when they materially help the reader.
Quirks to know
- •Coding-specialized — Qwen's release notes claim parity with GPT-4o on HumanEval+ and LiveCodeBench. Per the model card, code quality stays high on Python, JavaScript, TypeScript, Java, C++, Go, Rust, and SQL.
- •Native 128K context per the model card. Ideal for whole-repo code understanding.
- •Per the model card, fill-in-the-middle (FIM) is supported via the <|fim_prefix|>, <|fim_suffix|>, <|fim_middle|> special tokens. Useful for IDE-style autocomplete integrations.
- •Uses ChatML chat template (Qwen 2.5 variant — slightly different from Qwen 3 because no /think toggle).
- •Tool calling supported via Hermes-style format. Per the model card, the Coder variant is particularly reliable for code-execution tool chains compared to the base Qwen 2.5.
Chat template
<|im_start|>{role}\n{content}<|im_end|>. No /think marker — that's a Qwen 3 feature.
Tool calling
Same Hermes-style format as Qwen 3. Strong tool-call reliability per the Qwen 2.5 Coder release notes; suitable for code-execution agents.
Sampler settings
- temperature
- 0.2
- top_p
- 0.95
For code generation, the model card recommends low temperature (0.1-0.3). For exploratory pseudocode or planning, raise to 0.6.
Quantization variants
Each quantization trades model quality for file size and VRAM. Q4_K_M is the most popular starting point.
| Quantization | File size | VRAM required |
|---|---|---|
| Q4_K_M | 19.0 GB | 24 GB |
| Q8_0 | 34.0 GB | 40 GB |
Get the model
Ollama
One-line install
ollama run qwen2.5-coder:32bRead our Ollama review →HuggingFace
Original weights
Source repository — direct quantization required.
Hardware that runs this
Cards with enough VRAM for at least one quantization of Qwen 2.5 Coder 32B Instruct.
Models worth comparing
Same parameter band, plus what's one tier above and below — so you can decide what actually fits your hardware.
Frequently asked
What's the minimum VRAM to run Qwen 2.5 Coder 32B Instruct?
Can I use Qwen 2.5 Coder 32B Instruct commercially?
What's the context length of Qwen 2.5 Coder 32B Instruct?
How do I install Qwen 2.5 Coder 32B Instruct with Ollama?
Compare against other models
Curated head-to-head decisions where Qwen 2.5 Coder 32B Instruct is one of the contenders. For arbitrary pairings use /model-battle.
Source: huggingface.co/Qwen/Qwen2.5-Coder-32B-Instruct
Reviewed by RunLocalAI Editorial. See our editorial policy for how we research and verify model claims.
Related — keep moving
Verify Qwen 2.5 Coder 32B Instruct runs on your specific hardware before committing money.