DeepSeek V4 Flash (284B MoE)

The cost-efficient sibling of V4-Pro. 284B total / 13B active MoE, same hybrid CSA+HCA attention, same 1M context. The MoE active-param ratio (4.5%) makes it surprisingly fast for its nameplate size — practical on dual A100 / single H200 / Mac Studio M3 Ultra 192 GB.

License: MIT·Released Apr 24, 2026·Context: 1,048,576 tokens

Positioning

DeepSeek V4 Flash is the late-2025 / early-2026 mid-tier MoE that takes the V4 Pro reasoning DNA and packages it for the homelab and consumer-tier hardware that V4 Pro can't comfortably target. 284B total parameters with 16-22B active per token (variant-dependent). Where V4 Pro asks for 192-GB unified memory or rented datacenter GPUs, V4 Flash fits a 128-GB Mac Studio M3 Ultra at Q4 (170 GB partial-offload, doable) or runs comfortably on the 192-GB tier. The operator-grade pitch: same DeepSeek-team training discipline + reasoning-first design as V4 Pro at one-fifth the active parameter count. For most readers of this site, V4 Flash is the right DeepSeek V4 to actually consider running.

Strengths

Strong reasoning + coding combo at meaningfully lower hardware cost than V4 Pro. Benchmarks land 5-10 percentage points behind V4 Pro on most evals — which still beats most pre-V4-era frontier models.
MoE efficiency. ~16-22B active parameters per token mean inference math is closer to a 22B dense model than a 284B dense one. Tok/s is reasonable for the absolute parameter count, especially on Apple Silicon where the Mac Studio M3 Ultra at 192 GB delivers ~18-28 tok/s on Q4.
Permissive license — open weights, commercial use allowed (verify the DeepSeek license for your specific use case).
Genuinely deployable on consumer + homelab hardware. ~170-180 GB at Q4 fits 192-GB tier; ~140 GB at Q3 fits 128-GB tier with offload. Multi-GPU homelab (4× RTX 3090 = 96 GB) runs Q3 with system-RAM offload. Real options.
Strong day-zero tooling support. vLLM, SGLang, llama.cpp all shipped V4 Flash compatibility within hours of release. Less tooling lag than the V4 Pro path.

Limitations

Quality gap vs V4 Pro is real, not just on-paper. For the hardest reasoning tasks (long multi-step proofs, edge-case code refactoring), V4 Pro is meaningfully better. V4 Flash is "frontier-adjacent," not "frontier."
Memory is still substantial. ~170 GB at Q4 is far above any single consumer GPU. 24 GB cards are not in scope at any usable quant. The minimum operator-grade hardware is 96 GB unified memory or 4-card homelab.
Tok/s drops fast at lower quants. Q3 at ~13-18 tok/s on Mac Studio M3 Ultra is functional but not interactive. Q2 quality regressions are visible vs Q3.
Knowledge cutoff is early-2026 — for current-events / recent-API workloads, augment with RAG. Same constraint as V4 Pro.

Real-world performance on Mac Studio M3 Ultra (192 GB)

Q4 (~170 GB): ~18-28 tok/s decode, TTFT ~1-2s on 1K prompts. Genuinely interactive.
Q3 (~140 GB): ~22-32 tok/s, faster TTFT, slight quality dip. The right balance for daily use.
Q5 (~200 GB partial-offload to swap): ~10-15 tok/s. Quality bump over Q4 is small; rarely worth the speed loss on local hardware.
Compare with: rented H100 80GB ×4 datacenter setup runs FP8 V4 Flash at ~100-150 tok/s.

Should you run this locally?

Yes, if you have a 96-GB+ Mac Studio (or equivalent unified memory hardware) and want frontier-tier reasoning + coding output without the V4 Pro hardware cost. The Mac Studio M3 Ultra 192 GB tier at Q4 is genuinely interactive (~18-28 tok/s) — that's a usable daily driver, not just a batch-processing tool.

Yes, if you're running a 4-card homelab (RTX 3090 ×4 at $2,800-3,200 used) and willing to accept Q3 with partial offload. The economics work out — a Mac Studio M3 Ultra 192 GB is $5,000-7,000; the 4-card 3090 rig is half that.

No, for anyone running a single consumer GPU. 24 GB doesn't fit any usable quant. Use Qwen 3 30B-A3B or DeepSeek R1 Distill instead — designed for 24 GB tier hardware.

Probably not, for "I have a 192-GB Mac Studio and want absolute frontier" — pick V4 Pro at Q3 instead. Quality is meaningfully better; tok/s drops to 5-9 but it's the same hardware.

How it compares

vs DeepSeek V4 Pro (1.6T MoE) → V4 Pro has higher quality ceiling at much higher hardware cost. V4 Flash fits 128-GB hardware where V4 Pro doesn't. For 95% of operators reading these verdicts, V4 Flash is the right DeepSeek V4 to actually deploy.
vs Qwen 3.5 235B-A17B → Qwen 3.5 235B has Apache 2.0 license + better multilingual; V4 Flash has stronger reasoning chain-of-thought + slightly faster decode (16-22B active vs 17B active is essentially comparable). Pick on language requirements + license preference + reasoning-vs-generalist focus.
vs Qwen 3 235B-A22B → similar hardware tier, similar quality tier. Q3 235B-A22B has Apache 2.0 license; V4 Flash has stronger reasoning. Both are reasonable picks; the choice often comes down to which the operator already has loaded.
vs DeepSeek R1 (671B reasoning specialist) → R1 is reasoning-only; V4 Flash is generalist with reasoning. For reasoning-only workloads R1 wins; for mixed daily-driver tasks V4 Flash is more useful and runs on smaller hardware.
vs Llama 4 Scout (Meta MoE) → Llama 4 Scout has 128k effective context vs V4 Flash's 64k. Llama license has 700M MAU clause; DeepSeek license is more permissive. Pick on context-length + license.

Run this yourself

# Mac Studio M3 Ultra 192GB — Q4 fits (partial offload tolerable)
ollama pull deepseek-v4-flash:q4_K_M
ollama run deepseek-v4-flash:q4_K_M

# Or via llama.cpp directly:
llama-server -m deepseek-v4-flash-Q4_K_M.gguf \
  --ctx-size 65536 -ngl 999 --no-mmap

Quant: Q4_K_M GGUF Context: 65536 (KV cache f16, ~30 GB additional) Backend: llama.cpp Metal via Ollama Hardware: Mac Studio M3 Ultra 192 GB unified memory

Quantization	File size	VRAM required
Q4_K_M	162.0 GB	192 GB
Q5_K_M	198.0 GB	224 GB

Quantization

File size

VRAM required

Q4_K_M

162.0 GB

192 GB

Q5_K_M

198.0 GB

224 GB

Frequently asked

What's the minimum VRAM to run DeepSeek V4 Flash (284B MoE)?

192GB of VRAM is enough to run DeepSeek V4 Flash (284B MoE) at the Q4_K_M quantization (file size 162.0 GB). Higher-quality quantizations need more.

Can I use DeepSeek V4 Flash (284B MoE) commercially?

Yes — DeepSeek V4 Flash (284B MoE) ships under the MIT, which permits commercial use. Always read the license text before deployment.

What's the context length of DeepSeek V4 Flash (284B MoE)?

DeepSeek V4 Flash (284B MoE) supports a context window of 1,048,576 tokens (about 1049K).

DeepSeek V4 Flash (284B MoE)

Our verdict

Positioning

Strengths

Limitations

Real-world performance on Mac Studio M3 Ultra (192 GB)

Should you run this locally?

How it compares

Run this yourself

Overview

Family & lineage

Strengths

Weaknesses

Quantization variants

Get the model

HuggingFace

Hardware that runs this

Models worth comparing

Frequently asked

What's the minimum VRAM to run DeepSeek V4 Flash (284B MoE)?

Can I use DeepSeek V4 Flash (284B MoE) commercially?

What's the context length of DeepSeek V4 Flash (284B MoE)?

Related — keep moving