qwen

32B parameters

Commercial OK

Reviewed June 2026

Qwen 2.5 Coder 32B Instruct

Coding-specialist Qwen 2.5. Beats GPT-4o on HumanEval and matches Sonnet on many code-edit benchmarks. The default local-coding model on 24GB cards.

License: Apache 2.0·Released Nov 12, 2024·Context: 131,072 tokens

BLK · VERDICT

Our verdict

OP · Fredoline Eruo|VERIFIED JUN 12, 2026

9.2/10

Positioning

The model to run if you want a Cursor / Copilot replacement on your own hardware. Qwen 2.5 Coder 32B is the headline open-weight coding model — strong fill-in-the-middle, strong repo-scale reasoning, fast enough on a 4090 to keep up with interactive editing.

Strengths

Fill-in-the-middle is genuinely good — the actual mechanism Cursor and Copilot rely on, not just chat-style code completion.
Repo-aware reasoning — handles 32K-context code review tasks credibly; instruction-tuned to navigate multi-file context.
70–88 tok/s on 4090 Q4 — fast enough for interactive code-as-you-type once integrated with a properly streaming editor plugin.

Limitations

Qwen license MAU cap is a real concern for SaaS deployments.
Lags closed models on novel-architecture tasks — anything genuinely outside its training distribution still falls back to plausible-but-wrong patterns.
Repo-context isn't free — feeding a real codebase still requires good RAG or AST-aware chunking; the model alone won't fix bad context selection.

Real-world performance on RTX 4090

Q4_K_M (19 GB): 70–88 tok/s decode, TTFT ~140 ms
Q5_K_M (22.6 GB): 58–72 tok/s
Q8_0 (35 GB): partial offload, 18–25 tok/s

Should you run this locally?

Yes, for any developer with an RTX 3090 / 4090 / 5080+ who wants Copilot-class autocomplete without the cloud round-trip. The headline win for local AI. No, for developers comfortable with closed services for $10–20/month — for novel languages or rare frameworks, GPT-4 / Claude still produce more reliable code.

How it compares

vs DeepSeek Coder V2 Lite → Qwen 2.5 Coder 32B is meaningfully stronger; DeepSeek Coder V2 Lite (16B) is the right pick under 16 GB VRAM.
vs Codestral 22B → Qwen 2.5 Coder 32B wins on capability; Codestral has cleaner Mistral license terms.
vs Qwen 2.5 32B Instruct → Coder is dramatically better at coding; pick Instruct for general chat.
vs DeepSeek V3 / R1 → V3 and R1 are stronger at hard reasoning but uncommonly large for single-card use.

Run this yourself

ollama pull qwen2.5-coder:32b-instruct-q4_K_M
ollama run qwen2.5-coder:32b-instruct-q4_K_M

Settings: Q4_K_M GGUF, 16384 ctx, full GPU on 4090 Editor integration: Continue.dev or Tabby with Ollama backend

›Why this rating

9.2/10 — the strongest open-weight coding model that runs on a single 24 GB GPU. Genuinely competitive with closed coding models (GPT-4, Claude) on most non-frontier tasks. The only reason it loses points is the Qwen license MAU cap.

Overview

Coding-specialist Qwen 2.5. Beats GPT-4o on HumanEval and matches Sonnet on many code-edit benchmarks. The default local-coding model on 24GB cards.

Featured in these stacks

The L3 execution stacks that pick this model as a recommended component, with the one-line note explaining the role it plays in each.

Stack · L3·Workstation tier·Role: Coding model (the actual brain)
Build a local coding-agent stack (May 2026)
Qwen 2.5 Coder 32B Instruct is the strongest open coding model in the 32B class as of May 2026 — beats DeepSeek Coder V2 Lite on HumanEval+ and SWE-Bench Lite at the same VRAM footprint. AWQ-INT4 fits on a 24GB card with headroom for a 32K context window.
Stack · L3·Workstation tier·Role: Coding model (32B class)
Build an RTX 4090 AI workstation stack (May 2026)
Qwen 2.5 Coder 32B AWQ-INT4 is the strongest model that fits 24GB with real context room — beats DeepSeek Coder V2 Lite on coding benchmarks at the same VRAM budget. Reserve 8-10GB of VRAM for KV cache; 32K context is the sweet spot.
Stack · L3·Workstation tier·Role: Coding model (single-Mac primary)
Build a Mac-native AI stack (May 2026)
Qwen 2.5 Coder 32B in MLX-4bit quant runs comfortably on a 64GB M3 Max with room for 32K context. Beats DeepSeek Coder V2 Lite on coding benchmarks at the same memory footprint.
Stack · L3·Workstation tier·Role: Coding model
Build a fully offline coding stack (May 2026)
Qwen 2.5 Coder 32B AWQ-INT4 fits 24GB with 32K context — strongest open coding model in the 32B class as of May 2026. Apache 2.0 license: usable in any environment without licensing surprises. Pre-stage the AWQ weights locally before egress lockdown.
Stack · L3·Workstation tier·Role: 32B coding model that fits with concurrency room
Dual RTX 3090 workstation stack — 70B-class on $1,800 of used GPUs
32B class on dual-3090 leaves significant headroom — fits 8K context AND serves 8-16 concurrent coding-agent loops via vLLM continuous batching. The right pick when the workload is coding-tier rather than chat-tier.
Stack · L3·Homelab tier·Role: Coding agent serving for 16+ concurrent users
Quad RTX 3090 workstation stack — the prosumer 100B-class ceiling
32B-class coding model on quad-3090 leaves enormous headroom — vLLM continuous batching serves 16+ concurrent coding-agent loops at ~30 tok/s each. The team-tier coding-serve config.
Stack · L3·Homelab tier·Role: Coding model (better fit for this combo)
Mixed RTX 4090 + 3090 workstation — the asymmetric upgrade path
32B-class fits with substantial headroom on the asymmetric pair. The smaller model amplifies the mismatch less than 70B (fewer cross-card transitions per token).

Featured in this workflow

Full-system workflows that include this model as part of their service ledger — with the one-line operator note for each.

Workflow · System·homelab·Role: Coding LLM
Local coding-agent system
SWE-Bench Lite competitive at the 32B size class, Apache 2.0 license, strong tool-calling discipline. AWQ-INT4 fits 24 GB with 32K context.

Execution notes

L1.25 enriched

Operator notes

Qwen 2.5 Coder 32B Instruct is the canonical local coding model in May 2026 for the workstation tier. It's the model that turned /stacks/local-coding-agent from "research demo" into "production-grade autonomous coding agent on a single 4090."

What makes it the operator default:

AWQ-INT4 fits a 24 GB card with 32K context with comfortable KV-cache headroom for memory-injection patterns (Mem0 + MCP postgres add 2-5K tokens of system prompt).
Apache 2.0 license — no commercial-use friction.
Strong tool-calling discipline — emits OAI-shaped tool calls reliably; OpenHands / OpenClaw / Goose / Aider all report low parse-error rates.
SWE-Bench Lite competitive — within 5 points of the closed-source flagships at the 32B size class.

Deployment notes

The /stacks/local-coding-agent recipe pairs this model with vLLM + RTX 4090 + Mem0 + filesystem/git MCP. That's the configuration that hits ~38 tok/s decode and 60-180 second end-to-end iteration on bugfixes.

For 16 GB VRAM cards, drop to Qwen 2.5 Coder 14B — same family, fits 12 GB with 8K context. The 7B variant exists but trails by ~12 points on SWE-Bench, which is the threshold where autonomous-agent quality drops noticeably.

For team-shared workstation deployments (5+ users), SGLang wins over vLLM because OpenHands and OpenClaw both make 5-15 tool calls per task with a stable system prompt — RadixAttention's prefix cache compounds the wins across the cluster.

Runtime compatibility

vLLM ✓ excellent. AWQ-INT4 quant supported out of the box; recommended for the production-default path.
SGLang ✓ excellent. Same quant format; pick over vLLM when prefix-cache hit rate >50%.
Ollama ✓ good. Q4_K_M GGUF available; loses concurrency benefits vs vLLM but wins on solo-developer setup time.
MLX-LM ✓ good (MLX-4bit). Apple Silicon path; expect ~30% throughput drop vs RTX 4090 but unified memory means 32K context holds without VRAM contention.
TensorRT-LLM ✗ partial. Compiles but the recompile-per-config friction kills agent-loop iteration speed. Use only when committed to the NVIDIA stack.

Quantization suitability

AWQ-INT4 is the production-recommended quant. Q4_K_M GGUF is the alternative for llama.cpp / Ollama deployments; quality loss vs FP16 is 2% on coding benchmarks. Avoid Q3-class quants — the quality drop on coding tasks is meaningful (6-8% HumanEval+ regression).

For 32K context + 4 concurrent agents on a single card, drop to AWQ-INT3 — fits with extra headroom but expect ~5% additional quality loss.

Best use cases

Autonomous coding agents — OpenHands / OpenClaw / Goose paired with Mem0 for cross-session memory. The /stacks/local-coding-agent canonical setup.
Surgical-edit workflows — Aider for tight-control git-integrated edits; same model, different paradigm.
Multi-file refactoring — strong on SWE-Bench-shape "rename concept across N files" tasks.
Test-suite stabilization — reasoning depth handles flaky-test investigation well; pairs with MCP-git for commit-history grounding.

When to use a different model

Reasoning-first workloads (math, multi-step proof, complex algorithm design): use DeepSeek R1 Distill Qwen 32B or QwQ 32B — the explicit reasoning-token emission produces better plans on complex problems even though it costs 2-5x more tokens per query.
16 GB VRAM tier: drop to Qwen 2.5 Coder 14B.
Frontier-tier capability: cluster-deploy DeepSeek V4 or use Anthropic API — the 32B class hits a ceiling on the most complex multi-file architecture decisions.
Reproducibility-sensitive research: OpenCoder 8B — fully-open training data + recipes.

Failure modes specific to this model

Token-limit truncation on long generations. Default vLLM `max_tokens` is conservative; coding tasks producing large diffs need 2048+ to avoid truncation mid-edit.
Tool-call format misfires under high temperature. Keep temperature ≤0.3 for tool-calling workflows; ≤0.6 for chat.
Reasoning-tag confusion. Qwen 2.5 Coder doesn't emit <think> blocks (no reasoning-mode toggle here — that's Qwen 3). If your agent harness expects them, you'll see empty reasoning sections.

Going deeper

/stacks/local-coding-agent — canonical deployment recipe for this model
/stacks/memory-enabled-agent — memory-enabled variant with Mem0 + Letta
vLLM operational review — the runtime-specific operator detail
/systems/agent-execution-systems — the architectural depth on agent loops

Reviewed May 6, 2026 by Fredoline Eruo

Family & lineage

How this model relates to others in its lineage. Family members share architecture and training-data roots; parent / children edges record direct distillation or fine-tune relationships.

Family siblings (qwen-2.5-coder)

Qwen 2.5 Coder 1.5B1.5B

Edge

Qwen 2.5 Coder 3B3B

Edge

Qwen 2.5 Coder 7B Instruct7B

Consumer

Qwen 2.5 Coder 14B Instruct14B

Consumer

Qwen 2.5 Coder 32B Instruct32B

You are here

Distilled / fine-tuned from this

Qwen 2.5 Coder 7B Instruct7B

Consumer

Qwen 2.5 Coder 14B Instruct14B

Consumer

Strengths

Best open-weight coder at release
Apache 2.0
Strong fill-in-middle

Weaknesses

Less strong on general chat than non-coder Qwen

Prompting kit

From model card

source

Tested patterns for getting the most out of Qwen 2.5 Coder 32B Instruct locally. Local models are pickier about prompt structure than cloud models — what works on Claude or GPT-5 often fails here.

Recommended system prompt

You are Qwen, a senior software engineer created by Alibaba Cloud. Write clean, idiomatic code. Explain your reasoning before the code. When asked for code, prefer correctness over cleverness, and add comments only when they materially help the reader.

Quirks to know

•Coding-specialized — Qwen's release notes claim parity with GPT-4o on HumanEval+ and LiveCodeBench. Per the model card, code quality stays high on Python, JavaScript, TypeScript, Java, C++, Go, Rust, and SQL.
•Native 128K context per the model card. Ideal for whole-repo code understanding.
•Per the model card, fill-in-the-middle (FIM) is supported via the <|fim_prefix|>, <|fim_suffix|>, <|fim_middle|> special tokens. Useful for IDE-style autocomplete integrations.
•Uses ChatML chat template (Qwen 2.5 variant — slightly different from Qwen 3 because no /think toggle).
•Tool calling supported via Hermes-style format. Per the model card, the Coder variant is particularly reliable for code-execution tool chains compared to the base Qwen 2.5.

Chat template

ChatML (Qwen 2.5 variant)

<|im_start|>{role}\n{content}<|im_end|>. No /think marker — that's a Qwen 3 feature.

Tool calling

✓ Supported(hermes-style)

Same Hermes-style format as Qwen 3. Strong tool-call reliability per the Qwen 2.5 Coder release notes; suitable for code-execution agents.

Sampler settings

temperature: 0.2
top_p: 0.95

For code generation, the model card recommends low temperature (0.1-0.3). For exploratory pseudocode or planning, raise to 0.6.

Browse prompting kits for every model →/prompting

Quantization variants

Each quantization trades model quality for file size and VRAM. Q4_K_M is the most popular starting point.

Quantization	File size	VRAM required
Q4_K_M	19.0 GB	24 GB
Q8_0	34.0 GB	40 GB

Get the model

Ollama

One-line install

ollama run qwen2.5-coder:32bRead our Ollama review →

HuggingFace

Original weights

huggingface.co/Qwen/Qwen2.5-Coder-32B-Instruct

Source repository — direct quantization required.

Hardware that runs this

Cards with enough VRAM for at least one quantization of Qwen 2.5 Coder 32B Instruct.

NVIDIA B300 (Blackwell Ultra)

Compare alternatives

Models worth comparing

Same parameter band, plus what's one tier above and below — so you can decide what actually fits your hardware.

Same tier

Models in the same parameter band as this one

Step up

More capable — bigger memory footprint

Step down

Smaller — faster, runs on weaker hardware

Frequently asked

What's the minimum VRAM to run Qwen 2.5 Coder 32B Instruct?

24GB of VRAM is enough to run Qwen 2.5 Coder 32B Instruct at the Q4_K_M quantization (file size 19.0 GB). Higher-quality quantizations need more.

Can I use Qwen 2.5 Coder 32B Instruct commercially?

Yes — Qwen 2.5 Coder 32B Instruct ships under the Apache 2.0, which permits commercial use. Always read the license text before deployment.

What's the context length of Qwen 2.5 Coder 32B Instruct?

Qwen 2.5 Coder 32B Instruct supports a context window of 131,072 tokens (about 131K).

How do I install Qwen 2.5 Coder 32B Instruct with Ollama?

Run `ollama pull qwen2.5-coder:32b` to download, then `ollama run qwen2.5-coder:32b` to start a chat session. The default quantization is Q4_K_M.

Compare against other models

Curated head-to-head decisions where Qwen 2.5 Coder 32B Instruct is one of the contenders. For arbitrary pairings use /model-battle.

Qwen 2.5 Coder 32B vs DeepSeek R1 Distill Qwen 32B

which 32B for local coding?

Qwen 2.5 Coder 32B vs Qwen 3 32B

should you switch to the new generation?

Source: huggingface.co/Qwen/Qwen2.5-Coder-32B-Instruct

Reviewed by RunLocalAI Editorial. See our editorial policy for how we research and verify model claims.

Related — keep moving

Compare hardware

Buyer guides

When it doesn't work

Recommended hardware

Alternatives

Qwen 2.5 Coder 7B Instruct Qwen 2.5 Coder 1.5B Qwen 2.5 Coder 3B Qwen 2.5 Coder 14B Instruct

Before you buy

Verify Qwen 2.5 Coder 32B Instruct runs on your specific hardware before committing money.

Will it run on my hardware? →Custom hardware comparison →GPU recommender (4 questions) →