DeepSeek Coder V2 236B

Full DeepSeek Coder V2. 236B total / 21B active MoE coder.

License: DeepSeek License·Released Jun 17, 2024·Context: 131,072 tokens

BLK · VERDICT

Our verdict

OP · Fredoline Eruo|VERIFIED JUN 12, 2026

unrated

Positioning

DeepSeek Coder V2 236B is a Mixture-of-Experts (MoE) code generation model from DeepSeek, released under the DeepSeek License. With 236B total parameters but only ~21B activated per token, it offers the representational capacity of a large dense model while keeping inference compute closer to a dense 21B model. Its 131K context window is among the widest available for open-weight code models, making it suited for tasks involving long codebases or multi-file reasoning. This model is firmly in the datacenter deployment class due to its memory footprint.

Strengths

Massive total capacity with efficient inference: The MoE architecture activates only ~21B of 236B parameters per token, meaning per-token compute is comparable to a dense ~21B model while retaining the knowledge and reasoning of a much larger model.
Extremely long context window: 131,072 tokens of context length enables processing entire code repositories, long documentation, or complex multi-turn conversations without truncation.
Purpose-built for code: As a dedicated coder model, it is trained specifically for code understanding and generation tasks, making it a strong candidate for software development workflows.
Permissive commercial license: The DeepSeek License allows commercial use, making it suitable for integration into proprietary products and services.

Limitations

Datacenter-only deployment: Even at Q2_K quantization (~76.7 GB), the model requires multiple high-end GPUs (e.g., 4× A100 80GB or 8× RTX 6000 Ada) and substantial system memory. It is not feasible on consumer or single-workstation hardware.
No community-verified benchmarks: As of this writing, independent benchmark results (e.g., HumanEval, MMLU) are not yet widely reported. Operators should treat vendor-published metrics as best-case and verify on their own workloads.
MoE complexity: MoE models can be more challenging to deploy and optimize than dense models, requiring careful load balancing and kernel support in inference frameworks.
License restrictions: While permissive, the DeepSeek License may have specific terms (e.g., attribution requirements) that users should review before commercial deployment.

What it takes to run this locally

At FP16, the model requires 472 GB of storage and roughly 500+ GB of GPU memory for inference. Quantization reduces the footprint significantly: Q8_0 (251 GB), Q6_K (194.7 GB), Q5_K_M (168.2 GB), Q4_K_M (132.8 GB), Q3_K_M (115.0 GB), and Q2_K (~76.7 GB). Add 30–50% overhead for KV cache and framework memory at typical context lengths. This places the model in the datacenter deployment class: multi-GPU setups (e.g., 4–8× A100 80GB or H100) are required even at the lowest quantizations.

Should you run this locally?

Yes if: you have access to a multi-GPU datacenter cluster (4+ high-memory GPUs), need a state-of-the-art code model with a very long context window, and can tolerate the operational complexity of MoE deployment. The permissive license makes it attractive for commercial code tooling.

No if: you lack multi-GPU infrastructure, need real-time interactive speeds on consumer hardware, or prefer a simpler dense model for easier deployment. For single-GPU setups, consider smaller dense code models or smaller MoE models with lower total parameters.

Catalog cross-links

DeepSeek Coder V2 Lite – smaller MoE variant for more accessible deployment
DeepSeek V2 – general-purpose MoE model from the same family
DeepSeek License – details on usage terms

Overview

Full DeepSeek Coder V2. 236B total / 21B active MoE coder.

How to run it

DeepSeek Coder V2 236B is DeepSeek's large code-specialized MoE model — 236B total parameters (21B active per token with 2-of-27 expert routing). Run at Q4_K_M via llama.cpp with -ngl 999 -fa -c 16384. Q4_K_M file size ~135 GB on disk (full MoE). Minimum VRAM: 160 GB — 4× RTX A6000 (48GB each) with row-split, or 2× A100 80GB. Expert offload: with only active experts (42B total) in VRAM, VRAM drops to ~35-40 GB for weights + expert offload latency on routing. Recommended: 4× A100 80GB at AWQ-INT4 for all experts in VRAM. Throughput: ~10-20 tok/s on 4× A100 at Q4_K_M (8K context). The MoE architecture makes per-token compute efficient — ~21B active feels like a ~30B dense model for generation speed. DeepSeek Coder's architecture is well-supported in llama.cpp and vLLM (DeepSeek MoE kernels). Context: 128K advertised; practical at Q4 on 4× A100 is 8-16K.

Hardware guidance

Minimum: 2× A100 80GB at Q4_K_M with expert offload. Recommended: 4× A100 80GB at AWQ-INT4 (all experts in VRAM). Budget: 4× RTX A6000 192 GB at Q4_K_M with row-split. VRAM math: 236B total MoE, Q4 ≈ 135 GB. Expert offload reduces VRAM to ~35-40 GB active experts + RAM-resident inactive experts (adds routing latency). KV cache at 8K: ~10-15 GB. Total with all experts in VRAM: ~150 GB at 8K. 2× A100 80GB = 160 GB — tight. 4× A6000 192 GB — comfortable. RTX 4090 × 4 = 96 GB — insufficient for Q4 without aggressive expert offload to RAM. Mac Studio M4 Ultra 192 GB: Q4_K_M at 2-5 tok/s with expert offload. Cloud: 2-4× A100 at $10-30/hr.

What breaks first

DeepSeek MoE architecture. DeepSeek Coder V2 uses a custom MoE with shared experts + routed experts. Standard Mixtral MoE kernels in vLLM/llama.cpp may not handle the shared expert component correctly. Verify DeepSeek-specific MoE support. 2. Expert offload latency. With experts in system RAM, routing decisions hitting RAM-resident experts cause 50-150ms stalls. For code generation (which often has long, linear outputs), frequent stall patterns are disruptive. 3. Code quality at Q4. Code generation is precision-sensitive. Q4_K_M may produce measurably worse code (more bugs, worse structure) than Q8 or FP8. Benchmark your specific coding tasks. 4. DeepSeek's tokenizer. DeepSeek Coder uses a code-optimized tokenizer. Natural language prompts may be less token-efficient than specialized code prompts.

Runtime recommendation

vLLM with DeepSeek MoE support for production serving. llama.cpp with -ngl 999 for local multi-GPU use. DeepSeek Coder V2 is well-supported in both. SGLang if vLLM MoE routing is unstable. Avoid Ollama for multi-GPU — raw llama.cpp gives better tensor-split control.

Common beginner mistakes

Mistake: Assuming 236B needs 236 GB VRAM. Fix: Q4_K_M is ~135 GB. With expert offload, VRAM requirement drops to ~40 GB (active experts only). Do the math with your quantization. Mistake: Using standard Mixtral GGUF conversion for DeepSeek. Fix: DeepSeek has shared + routed experts. Use DeepSeek-specific GGUF conversion scripts. Mistake: Setting context to 128K on minimum hardware. Fix: KV cache at 128K is 80-100+ GB. 2× A100 at 160 GB with 135 GB weights = 25 GB left for KV = ~2K context. Scale context based on available VRAM after weights. Mistake: Expecting chat-quality code explanations. Fix: DeepSeek Coder V2 is code-specialized — it generates code, not conversational explanations. Use instruct variants for chatty code help.

Family & lineage

How this model relates to others in its lineage. Family members share architecture and training-data roots; parent / children edges record direct distillation or fine-tune relationships.

Family siblings (deepseek-coder)

DeepSeek Coder V2 Lite (16B)16B

Consumer

DeepSeek Coder V333B

Workstation

DeepSeek Coder V2 236B236B

You are here