deepseek

1600B parameters

Commercial OK

Reviewed June 2026

DeepSeek V4 Pro (1.6T MoE)

DeepSeek's April 2026 frontier flagship. 1.6T total / 49B active MoE with hybrid Compressed Sparse Attention + Heavily Compressed Attention. 1M context window. Closes most of the gap with Claude Opus 4.6 on coding while keeping MIT license + 27% of V3.2's per-token FLOPs.

License: MIT·Released Apr 24, 2026·Context: 1,048,576 tokens

BLK · VERDICT

Our verdict

OP · Fredoline Eruo|VERIFIED JUN 12, 2026

unrated

Positioning

DeepSeek V4 Pro is a 1.6T-parameter Mixture-of-Experts model with ~37B active parameters per token — the open-weight frontier of late-2025 / early-2026. For most local-AI operators it sits in a category called "I read about it, I don't run it locally." The model's job, in our editorial view, is two things: (1) set the upper-bound reference for what open-weight reasoning + coding can do, and (2) push the local-AI hardware ecosystem to make 192-GB-class workstations affordable enough to be operator-grade. The interesting question for our readers isn't "is V4 Pro good?" — yes, demonstrably — but "do you actually need it locally, or are you better off renting an API and saving the hardware budget?"

Strengths

Genuine frontier-tier reasoning + coding. V4 Pro is competitive with closed-source frontier models on HumanEval, GSM8K, MMLU-Pro, and SWE-bench Verified — the operator-grade benchmarks that actually predict daily-driver utility.
MoE efficiency. ~37B active parameters per token mean inference math is closer to a 37B dense model than a 1.6T dense one. Decode is fast where the memory fits.
Permissive license — open weights, commercial use allowed (verify the DeepSeek license for your specific use case, but the constraints are mild compared to Llama 4 Maverick's terms).
Reasonable serving footprint at low quants. ~210 GB at Q3 (the realistic homelab tier), ~140 GB at Q2 (functional with quality loss). 192-GB unified-memory consumer hardware (Mac Studio M3 Ultra) genuinely runs this — the only consumer-tier path that does.

Limitations

Memory is the wall. Q4 (280 GB) doesn't fit any consumer hardware. Q3 (210 GB) needs 192-GB unified memory or workstation cards. Q2 (140 GB) fits a 128-GB Mac Studio with offload. FP16 (3.2 TB) is datacenter-only forever.
Tok/s drops fast at low quants. Q3 on Mac Studio M3 Ultra: 5-~10 tok/s. Q2: slightly faster. This is "batch work tolerable, interactive chat painful" territory.
Quality at Q2 is meaningfully worse than Q4. Frontier models lose more from aggressive quantization than smaller models do. Don't run this at Q1.
No 24-GB-card path. RTX 5090 at 32 GB is far short of what V4 Pro needs at any usable quant.
Tooling lag. New MoE architectures take days-to-weeks for vLLM, SGLang, and llama.cpp to optimize fully. Day-zero performance lags peak performance by 2-4×.

Real-world performance on Mac Studio M3 Ultra (192 GB)

Q3 (~210 GB partial-offload to swap): 5-9 tok/s decode, TTFT in the low seconds for 1K prompts. Functional for batch work, painful for interactive chat.
Q2 (~140 GB fits): 8-~12 tok/s decode, faster TTFT, noticeable quality regression vs Q3.
Compare with: rented H100 80GB ×8 datacenter setup runs FP8 V4 Pro at ~60-100 tok/s — that's the actual production-grade serving target, not consumer hardware.

Should you run this locally?

Yes, if you have a 192-GB Mac Studio (or equivalent workstation), you're privacy-locked enough that DeepSeek API (hosted in China) isn't acceptable, you accept Q2-Q3 quality + 5-~10 tok/s, AND your workload tolerates batch latency rather than demanding interactive chat. Operator-grade niche, not mainstream.

No, for anyone running a single consumer GPU. Anyone whose use case is "I want frontier reasoning today" — rent the API (DeepSeek's hosted API, or wait for OpenRouter availability) at $0.14-0.27/M input tokens. The hosted-API cost-per-token is dramatically lower than the amortized hardware cost-per-token for sub-1k-msg/day operators.

Probably not, for anyone who can run DeepSeek R1 (the smaller 671B reasoning model) or Qwen 3 235B-A22B instead. These hit similar reasoning quality at meaningfully lower hardware requirements.

How it compares

vs DeepSeek R1 (671B reasoning) → R1 is the prior-generation reasoning specialist. V4 Pro is broader (better non-reasoning tasks) at much higher hardware cost. Pick R1 if reasoning is your only goal; V4 Pro if you also want coding + general capability at frontier-tier.
vs Qwen 3 235B-A22B (Qwen frontier) → Qwen 3 235B is more accessible (~140 GB at Q4 vs V4 Pro's 280 GB at Q4). Quality is comparable on most benchmarks but Qwen's multilingual edge is meaningful for non-English work. Pick Qwen for accessibility + multilingual; V4 Pro for absolute coding ceiling.
vs Llama 4 Maverick → Maverick is Meta's frontier MoE response. License terms are stricter (700M MAU clause + use restrictions). Quality is in the same ballpark; Maverick has stronger ecosystem support (vLLM tensor-parallel landed earlier) but the license is the operative constraint for many teams.
vs DeepSeek V4 Flash (284B MoE, smaller sibling) → V4 Flash is the consumer-tier accessible variant. Runs on a Mac Studio M3 Ultra at usable speeds in Q4. Same reasoning DNA as V4 Pro at a fraction of the memory cost. For 95% of operators, V4 Flash is the right choice and V4 Pro is academic.

Run this yourself (if you really must)

# Mac Studio M3 Ultra 192 GB — Q2 fits, Q3 with offload
ollama pull deepseek-v4-pro:q2_K
ollama run deepseek-v4-pro:q2_K

# Or via llama.cpp directly (more control over offload):
llama-server -m deepseek-v4-pro-Q3_K_M.gguf \
  --ctx-size 8192 -ngl 999 --no-mmap

Quant: Q3_K_M GGUF Context: 8192 (KV cache f16, ~16 GB additional) Backend: llama.cpp Metal via Ollama Hardware: Mac Studio M3 Ultra 192 GB unified memory

Overview

Featured in this stack

The L3 execution stacks that pick this model as a recommended component, with the one-line note explaining the role it plays in each.

Stack · L3·Production tier·Role: Frontier coder + reasoner (MIT license)
4× H100 SXM tensor-parallel workstation — frontier MoE serving reference
DeepSeek V4 Pro at FP8 or AWQ-INT4 on 4× H100. The open-weight coding ceiling in 2026. MIT license unblocks deployments that Qwen license blocks.

Execution notes

L1.25 enriched

Operator notes

DeepSeek V4 Pro is the open-weight ceiling for coding and reasoning in May 2026. It's the model that sets the bar that other open-weight flagships are measured against.

What makes it the operator default at the frontier tier:

MIT license — no commercial-use friction, unlike Qwen / Llama / Gemma equivalents.
Coding leader — strongest open-weight on SWE-Bench Verified and HumanEval+ as of May 2026.
Multi-token prediction — the MTP head delivers ~1.8× decode throughput vs equivalent-size single-token models.
Tool-calling discipline — the RL post-training stage was specifically tuned for agent harnesses.

Deployment notes

DeepSeek V4 Pro is firmly in the cluster-only deployment tier for self-hosting. AWQ-INT4 fits on:

8× H100 80GB (640 GB) — the production reference.
4× H200 141GB (564 GB) — slightly tighter; viable.
Apple Mac Studio M3 Ultra cluster (Exo) — research-only; quality preserved but throughput is impractical.

Most operators access this via API. Self-hosted only makes sense for orgs with dedicated coding-agent deployments at scale. The /stacks/local-coding-agent canonical setup Qwen 2.5 Coder 32B on a single 4090 covers 90% of operators; V4 Pro is for the 10% that need the absolute capability ceiling.

For sub-frontier hardware running the same family lineage:

Workstation tier: DeepSeek R1 Distill Qwen 32B preserves the R1 reasoning lineage on a single 4090.
Datacenter tier (without cluster): DeepSeek R1 Distill Llama 70B on dual-A100.

Runtime compatibility

vLLM ✓ excellent. MTP head supported as of vLLM 0.7+; tensor-parallel-size 8 is the H100 reference deployment.
SGLang ✓ excellent. RadixAttention prefix-cache + agent-loop is the highest-throughput configuration.
TensorRT-LLM ✓ best-in-class for throughput at scale. Recompile-per-config friction is real.
Ollama / llama.cpp ✗ impractical at this size. Single-machine GGUF was not designed for this tier.
MLX-LM ✓ partial via Exo cluster. Research-grade only.

Quantization suitability

AWQ-INT4 is the operational sweet spot. INT8 fits on 16× H100 but the quality lift over INT4 is sub-1% on most benchmarks — rarely justifies the 2× hardware cost.

The MTP head needs special quant handling. Some pipelines silently drop it during conversion, killing the throughput advantage. Verify your runtime preserves it before committing.

Best use cases

Frontier-tier coding agents — pair with vLLM tensor-parallel + filesystem/git MCP. The MTP head + tool-calling discipline + MIT license combination is unique.
Math + scientific reasoning at scale — leader on AIME / GPQA among open-weight.
Production agent serving for organizations — MIT license unblocks deployments that Qwen license blocks.

When to use a different model

Single-card coding (RTX 4090 / 5090 / 6000 Ada): Qwen 2.5 Coder 32B is the operator default. V4 Pro is overkill.
Workstation reasoning: DeepSeek R1 Distill Qwen 32B — same lineage, single-card.
Multilingual focus: Qwen 3.5 235B-A17B — stronger non-English coverage.
Higher decode throughput, modest quality drop: DeepSeek V4 Flash is the throughput-tuned sibling.

Failure modes specific to this model

MTP head silently dropped during quantization. Some AWQ pipelines lose the MTP head; you get a working model that's missing 1.8× of its decode advantage. Test throughput post-quant.
Tool-call format strictness. V4 Pro is more strict about JSON-shape than V3 — agent harnesses that rely on lenient parsing may regress.
Cluster cost. 8× H100 hours is real money. Most operators should default to Qwen 2.5 Coder 32B or hosted-API access.

Going deeper

/stacks/local-coding-agent — agent-loop deployment recipe
/maps/inference-runtimes-2026 — runtime ecosystem map
vLLM operational review — production-recommended runtime
DeepSeek R1 Distill Qwen 32B — workstation-tier sibling

Reviewed May 6, 2026 by Fredoline Eruo

Family & lineage

How this model relates to others in its lineage. Family members share architecture and training-data roots; parent / children edges record direct distillation or fine-tune relationships.

Parent / base model

DeepSeek V3 (671B MoE)671B

Frontier

Family siblings (deepseek-v)

DeepSeek V3 Lite (16B MoE)16B

Consumer

DeepSeek V2.5 236B236B

Datacenter

DeepSeek V4 Flash (284B MoE)284B

Datacenter

DeepSeek V3 (671B MoE)671B

Frontier