Qwen 3 Coder 32B
Coding-specialized fine-tune of Qwen 3 32B. Curated coding corpus; outperforms Qwen 2.5 Coder 32B on SWE-Bench by ~6 points. Apache 2.0.
Positioning
Qwen 3 Coder 32B is a coding-specialized dense model from Alibaba, released under the permissive Apache 2.0 license. With 32 billion parameters and a 128K context window, it is designed for complex software engineering and agentic coding tasks. As a fine-tune of Qwen 3 32B on a curated coding corpus, it represents a targeted improvement over its predecessor, Qwen 2.5 Coder 32B, particularly in benchmark scenarios like SWE-Bench. Its dense architecture means inference cost scales linearly with parameter count, making it a straightforward choice for workstation-class deployments.
Strengths
- Permissive Apache 2.0 license: Enables unrestricted commercial use, modification, and redistribution, making it ideal for enterprise coding pipelines.
- 128K context window: Accommodates large codebases, multi-file projects, and lengthy agent reasoning traces without truncation.
- Coding-specialized fine-tuning: Curated training data targets software engineering tasks, offering a focused alternative to general-purpose models.
- Dense architecture simplicity: Unlike MoE models, dense 32B has predictable memory and compute requirements, easing deployment planning.
Limitations
- High memory floor: At FP16, the model requires 64 GB of disk space, and even at Q4_K_M (18 GB) the KV cache for 128K context can add 30–50% overhead, pushing total VRAM needs beyond typical consumer GPUs.
- No MoE efficiency: All 32B parameters are active per token, so inference cost is higher than an equivalently sized MoE model with a smaller active parameter count.
- Narrow specialization: While excellent for coding, it may underperform on general knowledge or creative tasks compared to similarly sized general-purpose models.
- Limited community validation: We do not have independent measurements for this model; vendor-reported benchmark gains (e.g., ~6 points on SWE-Bench) should be treated as best-case until replicated.
What it takes to run this locally
Quantized sizes range from 64 GB (FP16) down to ~10.4 GB (Q2_K). For practical use with a 128K context, add 30–50% for KV cache and framework overhead. A Q4_K_M quant (18 GB + overhead) fits comfortably on a single 24 GB GPU (e.g., RTX 4090, RTX 6000 Ada). Q3_K_M (15.6 GB) or Q2_K (10.4 GB) may run on 16 GB GPUs with reduced context length. Dual GPU setups (e.g., two 24 GB cards) can handle FP16 or Q8_0 with full context. This model is firmly in the workstation deployment class; consumer GPUs with ≤12 GB VRAM are not recommended.
Should you run this locally?
Yes if you need a permissively licensed, coding-specialized model for commercial agent workflows and have a workstation-class GPU (≥24 GB VRAM) to run Q4_K_M or higher quants with adequate context. No if you lack the hardware for 128K context overhead, or if your use case requires general-purpose reasoning beyond code generation.
Catalog cross-links
- Qwen 3 32B
- Qwen 2.5 Coder 32B
- Workstation deployment guide
Overview
Coding-specialized fine-tune of Qwen 3 32B. Curated coding corpus; outperforms Qwen 2.5 Coder 32B on SWE-Bench by ~6 points. Apache 2.0.
How to run it
Qwen 3 Coder 32B is Alibaba's code-specialized 32B dense model — the coding-focused member of the Qwen 3 family. Run at Q4_K_M via Ollama (ollama pull qwen3-coder:32b) or llama.cpp with -ngl 999 -fa -c 16384. Q4_K_M file size ~18 GB on disk. Minimum VRAM: 16 GB — RTX 4080 (16GB) at Q4_K_M with KV offload. RTX 4090 24GB: Q4_K_M comfortably at 16K context. Recommended: RTX 4090 24GB at Q4_K_M. Throughput: ~35-55 tok/s on RTX 4090 at Q4_K_M. Qwen 3 architecture — broad support. Coder is specialized for code generation, debugging, code review, and technical explanation. Supports FIM (fill-in-the-middle) for IDE code completion. Strong on: Python, TypeScript, Java, Go, Rust, C++. Less strong on: general chat, creative writing — use base Qwen 3 32B instead. Context: Qwen 3's 128K (practical 16-32K on 24 GB for code). Code generation typically uses shorter contexts (2-8K) — KV cache is less of a pressure. For larger code models: DeepSeek Coder V2 236B. For smaller: Qwen 3 Coder 7B.
Hardware guidance
Minimum: RTX 3060 12GB at Q3_K_M with KV offload. Recommended: RTX 4090 24GB at Q4_K_M (16K context). Optimal: RTX 5090 32GB at Q4_K_M (32K+ context). VRAM math: 32B dense, Q4_K_M ≈ 18 GB. KV cache at 16K: ~8 GB. Total: ~26 GB. RTX 4090 24GB: Q4 + 8-12K context on-GPU. Code contexts are typically 2-8K — efficient. RTX 3090 24GB: same. RTX 4080 16GB: Q4 + 2K on-GPU. MacBook Pro M4 Pro 24GB+: Q4 at 10-20 tok/s. Cloud: A10 24GB at Q4_K_M. AWQ-INT4 drops to ~16 GB. For IDE integration (FIM), budget extra context for surrounding code + prefix/suffix. Tab completion bursts benefit from high single-batch throughput — RTX 4090 is ideal.
What breaks first
- FIM support varies by runtime. Fill-in-the-middle requires FIM-aware inference stacks. Ollama may not expose FIM. Continue.dev + llama.cpp with FIM is the standard path. 2. Code quality at Q3. Syntax-level errors increase at Q3 — hallucinated function names, wrong parameter types, broken imports. Use Q4_K_M minimum for production code generation. 3. API hallucination. Like all code models, Qwen 3 Coder hallucinates APIs — especially for less common libraries. Pair with RAG on current API docs. 4. Chat template for coder vs chat. Qwen 3 Coder uses a FIM-aware chat template that differs from Qwen 3 Chat. Using the wrong template breaks code completion formatting.
Runtime recommendation
Continue.dev + llama.cpp FIM for IDE code completion. Ollama for chat-based code help. vLLM for serving. Qwen 3 architecture — well-supported. For FIM: ensure your llama.cpp build has FIM enabled and use a FIM-aware frontend.
Common beginner mistakes
Mistake: Using Qwen 3 Coder for general chat. Fix: Code specialization degrades general conversational quality. Use base Qwen 3 32B for non-code tasks. Mistake: Expecting FIM to work in Ollama. Fix: Ollama's chat interface doesn't expose FIM formatting. Use llama.cpp directly with a FIM-aware client like Continue.dev. Mistake: Using default Qwen 3 chat template for coder model. Fix: Qwen 3 Coder has a FIM-specific template. Check the hf repo for the correct format. Mistake: Trusting generated code without testing. Fix: Qwen 3 Coder generates plausible-looking code that may have subtle bugs. Always test generated code, especially for security-sensitive or production systems.
Family & lineage
How this model relates to others in its lineage. Family members share architecture and training-data roots; parent / children edges record direct distillation or fine-tune relationships.
Strengths
- Strongest open coding model in 32B class as of late 2025
- Reasoning toggle for complex bugs
- Apache 2.0
Weaknesses
- AWQ-INT4 fits 24GB tightly with reasoning blocks
Quantization variants
Each quantization trades model quality for file size and VRAM. Q4_K_M is the most popular starting point.
| Quantization | File size | VRAM required |
|---|---|---|
| AWQ-INT4 | 19.0 GB | 22 GB |
Get the model
HuggingFace
Original weights
Source repository — direct quantization required.
Hardware that runs this
Cards with enough VRAM for at least one quantization of Qwen 3 Coder 32B.
Models worth comparing
Same parameter band, plus what's one tier above and below — so you can decide what actually fits your hardware.
Frequently asked
What's the minimum VRAM to run Qwen 3 Coder 32B?
Can I use Qwen 3 Coder 32B commercially?
What's the context length of Qwen 3 Coder 32B?
Source: huggingface.co/Qwen/Qwen3-Coder-32B
Reviewed by RunLocalAI Editorial. See our editorial policy for how we research and verify model claims.
Related — keep moving
Verify Qwen 3 Coder 32B runs on your specific hardware before committing money.