DeepSeek V4 Pro (1.6T MoE)
DeepSeek's April 2026 frontier flagship. 1.6T total / 49B active MoE with hybrid Compressed Sparse Attention + Heavily Compressed Attention. 1M context window. Closes most of the gap with Claude Opus 4.6 on coding while keeping MIT license + 27% of V3.2's per-token FLOPs.
Positioning
DeepSeek V4 Pro is a 1.6T-parameter Mixture-of-Experts model with ~37B active parameters per token — the open-weight frontier of late-2025 / early-2026. For most local-AI operators it sits in a category called "I read about it, I don't run it locally." The model's job, in our editorial view, is two things: (1) set the upper-bound reference for what open-weight reasoning + coding can do, and (2) push the local-AI hardware ecosystem to make 192-GB-class workstations affordable enough to be operator-grade. The interesting question for our readers isn't "is V4 Pro good?" — yes, demonstrably — but "do you actually need it locally, or are you better off renting an API and saving the hardware budget?"
Strengths
- Genuine frontier-tier reasoning + coding. V4 Pro is competitive with closed-source frontier models on HumanEval, GSM8K, MMLU-Pro, and SWE-bench Verified — the operator-grade benchmarks that actually predict daily-driver utility.
- MoE efficiency. ~37B active parameters per token mean inference math is closer to a 37B dense model than a 1.6T dense one. Decode is fast where the memory fits.
- Permissive license — open weights, commercial use allowed (verify the DeepSeek license for your specific use case, but the constraints are mild compared to Llama 4 Maverick's terms).
- Reasonable serving footprint at low quants. ~210 GB at Q3 (the realistic homelab tier), ~140 GB at Q2 (functional with quality loss). 192-GB unified-memory consumer hardware (Mac Studio M3 Ultra) genuinely runs this — the only consumer-tier path that does.
Limitations
- Memory is the wall. Q4 (280 GB) doesn't fit any consumer hardware. Q3 (210 GB) needs 192-GB unified memory or workstation cards. Q2 (140 GB) fits a 128-GB Mac Studio with offload. FP16 (3.2 TB) is datacenter-only forever.
- Tok/s drops fast at low quants. Q3 on Mac Studio M3 Ultra: 5-~10 tok/s. Q2: slightly faster. This is "batch work tolerable, interactive chat painful" territory.
- Quality at Q2 is meaningfully worse than Q4. Frontier models lose more from aggressive quantization than smaller models do. Don't run this at Q1.
- No 24-GB-card path. RTX 5090 at 32 GB is far short of what V4 Pro needs at any usable quant.
- Tooling lag. New MoE architectures take days-to-weeks for vLLM, SGLang, and llama.cpp to optimize fully. Day-zero performance lags peak performance by 2-4×.
Real-world performance on Mac Studio M3 Ultra (192 GB)
- Q3 (~210 GB partial-offload to swap): 5-9 tok/s decode, TTFT in the low seconds for 1K prompts. Functional for batch work, painful for interactive chat.
- Q2 (~140 GB fits): 8-~12 tok/s decode, faster TTFT, noticeable quality regression vs Q3.
- Compare with: rented H100 80GB ×8 datacenter setup runs FP8 V4 Pro at ~60-100 tok/s — that's the actual production-grade serving target, not consumer hardware.
Should you run this locally?
Yes, if you have a 192-GB Mac Studio (or equivalent workstation), you're privacy-locked enough that DeepSeek API (hosted in China) isn't acceptable, you accept Q2-Q3 quality + 5-~10 tok/s, AND your workload tolerates batch latency rather than demanding interactive chat. Operator-grade niche, not mainstream.
No, for anyone running a single consumer GPU. Anyone whose use case is "I want frontier reasoning today" — rent the API (DeepSeek's hosted API, or wait for OpenRouter availability) at $0.14-0.27/M input tokens. The hosted-API cost-per-token is dramatically lower than the amortized hardware cost-per-token for sub-1k-msg/day operators.
Probably not, for anyone who can run DeepSeek R1 (the smaller 671B reasoning model) or Qwen 3 235B-A22B instead. These hit similar reasoning quality at meaningfully lower hardware requirements.
How it compares
- vs DeepSeek R1 (671B reasoning) → R1 is the prior-generation reasoning specialist. V4 Pro is broader (better non-reasoning tasks) at much higher hardware cost. Pick R1 if reasoning is your only goal; V4 Pro if you also want coding + general capability at frontier-tier.
- vs Qwen 3 235B-A22B (Qwen frontier) → Qwen 3 235B is more accessible (~140 GB at Q4 vs V4 Pro's 280 GB at Q4). Quality is comparable on most benchmarks but Qwen's multilingual edge is meaningful for non-English work. Pick Qwen for accessibility + multilingual; V4 Pro for absolute coding ceiling.
- vs Llama 4 Maverick → Maverick is Meta's frontier MoE response. License terms are stricter (700M MAU clause + use restrictions). Quality is in the same ballpark; Maverick has stronger ecosystem support (vLLM tensor-parallel landed earlier) but the license is the operative constraint for many teams.
- vs DeepSeek V4 Flash (284B MoE, smaller sibling) → V4 Flash is the consumer-tier accessible variant. Runs on a Mac Studio M3 Ultra at usable speeds in Q4. Same reasoning DNA as V4 Pro at a fraction of the memory cost. For 95% of operators, V4 Flash is the right choice and V4 Pro is academic.
Run this yourself (if you really must)
# Mac Studio M3 Ultra 192 GB — Q2 fits, Q3 with offload
ollama pull deepseek-v4-pro:q2_K
ollama run deepseek-v4-pro:q2_K
# Or via llama.cpp directly (more control over offload):
llama-server -m deepseek-v4-pro-Q3_K_M.gguf \
--ctx-size 8192 -ngl 999 --no-mmap
Quant: Q3_K_M GGUF
Context: 8192 (KV cache f16, ~16 GB additional)
Backend: llama.cpp Metal via Ollama
Hardware: Mac Studio M3 Ultra 192 GB unified memory
Overview
DeepSeek's April 2026 frontier flagship. 1.6T total / 49B active MoE with hybrid Compressed Sparse Attention + Heavily Compressed Attention. 1M context window. Closes most of the gap with Claude Opus 4.6 on coding while keeping MIT license + 27% of V3.2's per-token FLOPs.
Featured in this stack
The L3 execution stacks that pick this model as a recommended component, with the one-line note explaining the role it plays in each.
- Stack · L3·Production tier·Role: Frontier coder + reasoner (MIT license)4× H100 SXM tensor-parallel workstation — frontier MoE serving reference
DeepSeek V4 Pro at FP8 or AWQ-INT4 on 4× H100. The open-weight coding ceiling in 2026. MIT license unblocks deployments that Qwen license blocks.
Execution notes
Operator notes
DeepSeek V4 Pro is the open-weight ceiling for coding and reasoning in May 2026. It's the model that sets the bar that other open-weight flagships are measured against.
What makes it the operator default at the frontier tier:
- MIT license — no commercial-use friction, unlike Qwen / Llama / Gemma equivalents.
- Coding leader — strongest open-weight on SWE-Bench Verified and HumanEval+ as of May 2026.
- Multi-token prediction — the MTP head delivers ~1.8× decode throughput vs equivalent-size single-token models.
- Tool-calling discipline — the RL post-training stage was specifically tuned for agent harnesses.
Deployment notes
DeepSeek V4 Pro is firmly in the cluster-only deployment tier for self-hosting. AWQ-INT4 fits on:
- 8× H100 80GB (640 GB) — the production reference.
- 4× H200 141GB (564 GB) — slightly tighter; viable.
- Apple Mac Studio M3 Ultra cluster (Exo) — research-only; quality preserved but throughput is impractical.
Most operators access this via API. Self-hosted only makes sense for orgs with dedicated coding-agent deployments at scale. The /stacks/local-coding-agent canonical setup Qwen 2.5 Coder 32B on a single 4090 covers 90% of operators; V4 Pro is for the 10% that need the absolute capability ceiling.
For sub-frontier hardware running the same family lineage:
- Workstation tier: DeepSeek R1 Distill Qwen 32B preserves the R1 reasoning lineage on a single 4090.
- Datacenter tier (without cluster): DeepSeek R1 Distill Llama 70B on dual-A100.
Runtime compatibility
- vLLM ✓ excellent. MTP head supported as of vLLM 0.7+; tensor-parallel-size 8 is the H100 reference deployment.
- SGLang ✓ excellent. RadixAttention prefix-cache + agent-loop is the highest-throughput configuration.
- TensorRT-LLM ✓ best-in-class for throughput at scale. Recompile-per-config friction is real.
- Ollama / llama.cpp ✗ impractical at this size. Single-machine GGUF was not designed for this tier.
- MLX-LM ✓ partial via Exo cluster. Research-grade only.
Quantization suitability
AWQ-INT4 is the operational sweet spot. INT8 fits on 16× H100 but the quality lift over INT4 is sub-1% on most benchmarks — rarely justifies the 2× hardware cost.
The MTP head needs special quant handling. Some pipelines silently drop it during conversion, killing the throughput advantage. Verify your runtime preserves it before committing.
Best use cases
- Frontier-tier coding agents — pair with vLLM tensor-parallel + filesystem/git MCP. The MTP head + tool-calling discipline + MIT license combination is unique.
- Math + scientific reasoning at scale — leader on AIME / GPQA among open-weight.
- Production agent serving for organizations — MIT license unblocks deployments that Qwen license blocks.
When to use a different model
- Single-card coding (RTX 4090 / 5090 / 6000 Ada): Qwen 2.5 Coder 32B is the operator default. V4 Pro is overkill.
- Workstation reasoning: DeepSeek R1 Distill Qwen 32B — same lineage, single-card.
- Multilingual focus: Qwen 3.5 235B-A17B — stronger non-English coverage.
- Higher decode throughput, modest quality drop: DeepSeek V4 Flash is the throughput-tuned sibling.
Failure modes specific to this model
- MTP head silently dropped during quantization. Some AWQ pipelines lose the MTP head; you get a working model that's missing 1.8× of its decode advantage. Test throughput post-quant.
- Tool-call format strictness. V4 Pro is more strict about JSON-shape than V3 — agent harnesses that rely on lenient parsing may regress.
- Cluster cost. 8× H100 hours is real money. Most operators should default to Qwen 2.5 Coder 32B or hosted-API access.
Going deeper
- /stacks/local-coding-agent — agent-loop deployment recipe
- /maps/inference-runtimes-2026 — runtime ecosystem map
- vLLM operational review — production-recommended runtime
- DeepSeek R1 Distill Qwen 32B — workstation-tier sibling
Family & lineage
How this model relates to others in its lineage. Family members share architecture and training-data roots; parent / children edges record direct distillation or fine-tune relationships.
Strengths
- Strongest open coder of 2026 — closes in on Claude Opus 4.6
- 1M token context window with CSA+HCA attention
- 27% per-token FLOPs vs V3.2; 10% KV cache
- MIT license — fully open weights
Weaknesses
- 1.6T total params — workstation cluster or cloud GPU only
- Q4_K_M still ~920 GB on disk
- Local deployment is research-tier only
Quantization variants
Each quantization trades model quality for file size and VRAM. Q4_K_M is the most popular starting point.
| Quantization | File size | VRAM required |
|---|---|---|
| Q4_K_M | 920.0 GB | 1024 GB |
Get the model
HuggingFace
Original weights
Source repository — direct quantization required.
Hardware that runs this
Cards with enough VRAM for at least one quantization of DeepSeek V4 Pro (1.6T MoE).
Models worth comparing
Same parameter band, plus what's one tier above and below — so you can decide what actually fits your hardware.
Frequently asked
What's the minimum VRAM to run DeepSeek V4 Pro (1.6T MoE)?
Can I use DeepSeek V4 Pro (1.6T MoE) commercially?
What's the context length of DeepSeek V4 Pro (1.6T MoE)?
Source: huggingface.co/deepseek-ai/DeepSeek-V4-Pro
Reviewed by RunLocalAI Editorial. See our editorial policy for how we research and verify model claims.
Related — keep moving
Verify DeepSeek V4 Pro (1.6T MoE) runs on your specific hardware before committing money.