deepseek
284B parameters
Commercial OK
Reviewed June 2026

DeepSeek V4 Flash (284B MoE)

The cost-efficient sibling of V4-Pro. 284B total / 13B active MoE, same hybrid CSA+HCA attention, same 1M context. The MoE active-param ratio (4.5%) makes it surprisingly fast for its nameplate size — practical on dual A100 / single H200 / Mac Studio M3 Ultra 192 GB.

License: MIT·Released Apr 24, 2026·Context: 1,048,576 tokens
BLK · VERDICT

Our verdict

OP · Fredoline Eruo|VERIFIED JUN 12, 2026
unrated

Positioning

DeepSeek V4 Flash is the late-2025 / early-2026 mid-tier MoE that takes the V4 Pro reasoning DNA and packages it for the homelab and consumer-tier hardware that V4 Pro can't comfortably target. 284B total parameters with 16-22B active per token (variant-dependent). Where V4 Pro asks for 192-GB unified memory or rented datacenter GPUs, V4 Flash fits a 128-GB Mac Studio M3 Ultra at Q4 (170 GB partial-offload, doable) or runs comfortably on the 192-GB tier. The operator-grade pitch: same DeepSeek-team training discipline + reasoning-first design as V4 Pro at one-fifth the active parameter count. For most readers of this site, V4 Flash is the right DeepSeek V4 to actually consider running.

Strengths

  • Strong reasoning + coding combo at meaningfully lower hardware cost than V4 Pro. Benchmarks land 5-10 percentage points behind V4 Pro on most evals — which still beats most pre-V4-era frontier models.
  • MoE efficiency. ~16-22B active parameters per token mean inference math is closer to a 22B dense model than a 284B dense one. Tok/s is reasonable for the absolute parameter count, especially on Apple Silicon where the Mac Studio M3 Ultra at 192 GB delivers ~18-28 tok/s on Q4.
  • Permissive license — open weights, commercial use allowed (verify the DeepSeek license for your specific use case).
  • Genuinely deployable on consumer + homelab hardware. ~170-180 GB at Q4 fits 192-GB tier; ~140 GB at Q3 fits 128-GB tier with offload. Multi-GPU homelab (4× RTX 3090 = 96 GB) runs Q3 with system-RAM offload. Real options.
  • Strong day-zero tooling support. vLLM, SGLang, llama.cpp all shipped V4 Flash compatibility within hours of release. Less tooling lag than the V4 Pro path.

Limitations

  • Quality gap vs V4 Pro is real, not just on-paper. For the hardest reasoning tasks (long multi-step proofs, edge-case code refactoring), V4 Pro is meaningfully better. V4 Flash is "frontier-adjacent," not "frontier."
  • Memory is still substantial. ~170 GB at Q4 is far above any single consumer GPU. 24 GB cards are not in scope at any usable quant. The minimum operator-grade hardware is 96 GB unified memory or 4-card homelab.
  • Tok/s drops fast at lower quants. Q3 at ~13-18 tok/s on Mac Studio M3 Ultra is functional but not interactive. Q2 quality regressions are visible vs Q3.
  • Knowledge cutoff is early-2026 — for current-events / recent-API workloads, augment with RAG. Same constraint as V4 Pro.

Real-world performance on Mac Studio M3 Ultra (192 GB)

  • Q4 (~170 GB): ~18-28 tok/s decode, TTFT ~1-2s on 1K prompts. Genuinely interactive.
  • Q3 (~140 GB): ~22-32 tok/s, faster TTFT, slight quality dip. The right balance for daily use.
  • Q5 (~200 GB partial-offload to swap): ~10-15 tok/s. Quality bump over Q4 is small; rarely worth the speed loss on local hardware.
  • Compare with: rented H100 80GB ×4 datacenter setup runs FP8 V4 Flash at ~100-150 tok/s.

Should you run this locally?

Yes, if you have a 96-GB+ Mac Studio (or equivalent unified memory hardware) and want frontier-tier reasoning + coding output without the V4 Pro hardware cost. The Mac Studio M3 Ultra 192 GB tier at Q4 is genuinely interactive (~18-28 tok/s) — that's a usable daily driver, not just a batch-processing tool.

Yes, if you're running a 4-card homelab (RTX 3090 ×4 at $2,800-3,200 used) and willing to accept Q3 with partial offload. The economics work out — a Mac Studio M3 Ultra 192 GB is $5,000-7,000; the 4-card 3090 rig is half that.

No, for anyone running a single consumer GPU. 24 GB doesn't fit any usable quant. Use Qwen 3 30B-A3B or DeepSeek R1 Distill instead — designed for 24 GB tier hardware.

Probably not, for "I have a 192-GB Mac Studio and want absolute frontier" — pick V4 Pro at Q3 instead. Quality is meaningfully better; tok/s drops to 5-9 but it's the same hardware.

How it compares

  • vs DeepSeek V4 Pro (1.6T MoE) → V4 Pro has higher quality ceiling at much higher hardware cost. V4 Flash fits 128-GB hardware where V4 Pro doesn't. For 95% of operators reading these verdicts, V4 Flash is the right DeepSeek V4 to actually deploy.
  • vs Qwen 3.5 235B-A17B → Qwen 3.5 235B has Apache 2.0 license + better multilingual; V4 Flash has stronger reasoning chain-of-thought + slightly faster decode (16-22B active vs 17B active is essentially comparable). Pick on language requirements + license preference + reasoning-vs-generalist focus.
  • vs Qwen 3 235B-A22B → similar hardware tier, similar quality tier. Q3 235B-A22B has Apache 2.0 license; V4 Flash has stronger reasoning. Both are reasonable picks; the choice often comes down to which the operator already has loaded.
  • vs DeepSeek R1 (671B reasoning specialist) → R1 is reasoning-only; V4 Flash is generalist with reasoning. For reasoning-only workloads R1 wins; for mixed daily-driver tasks V4 Flash is more useful and runs on smaller hardware.
  • vs Llama 4 Scout (Meta MoE) → Llama 4 Scout has 128k effective context vs V4 Flash's 64k. Llama license has 700M MAU clause; DeepSeek license is more permissive. Pick on context-length + license.

Run this yourself

# Mac Studio M3 Ultra 192GB — Q4 fits (partial offload tolerable)
ollama pull deepseek-v4-flash:q4_K_M
ollama run deepseek-v4-flash:q4_K_M

# Or via llama.cpp directly:
llama-server -m deepseek-v4-flash-Q4_K_M.gguf \
  --ctx-size 65536 -ngl 999 --no-mmap
Quant: Q4_K_M GGUF Context: 65536 (KV cache f16, ~30 GB additional) Backend: llama.cpp Metal via Ollama Hardware: Mac Studio M3 Ultra 192 GB unified memory

Overview

The cost-efficient sibling of V4-Pro. 284B total / 13B active MoE, same hybrid CSA+HCA attention, same 1M context. The MoE active-param ratio (4.5%) makes it surprisingly fast for its nameplate size — practical on dual A100 / single H200 / Mac Studio M3 Ultra 192 GB.

Family & lineage

How this model relates to others in its lineage. Family members share architecture and training-data roots; parent / children edges record direct distillation or fine-tune relationships.

Strengths

  • 13B active params — fast despite 284B nameplate
  • 1M context window with same hybrid attention as V4-Pro
  • MIT license, $0.14/$0.28 per 1M tokens via DeepSeek API
  • Single Mac Studio M3 Ultra 192GB runs it via MLX

Weaknesses

  • 162 GB Q4_K_M — workstation hardware required
  • Quality below V4-Pro on hardest reasoning tasks
  • MoE quant degradation faster below Q4 than dense models

Quantization variants

Each quantization trades model quality for file size and VRAM. Q4_K_M is the most popular starting point.

QuantizationFile sizeVRAM required
Q4_K_M162.0 GB192 GB
Q5_K_M198.0 GB224 GB

Get the model

HuggingFace

Original weights

huggingface.co/deepseek-ai/DeepSeek-V4-Flash

Source repository — direct quantization required.

Hardware that runs this

Cards with enough VRAM for at least one quantization of DeepSeek V4 Flash (284B MoE).

Compare alternatives

Models worth comparing

Same parameter band, plus what's one tier above and below — so you can decide what actually fits your hardware.

Step up
More capable — bigger memory footprint
No verdicted models in the next tier up yet.

Frequently asked

What's the minimum VRAM to run DeepSeek V4 Flash (284B MoE)?

192GB of VRAM is enough to run DeepSeek V4 Flash (284B MoE) at the Q4_K_M quantization (file size 162.0 GB). Higher-quality quantizations need more.

Can I use DeepSeek V4 Flash (284B MoE) commercially?

Yes — DeepSeek V4 Flash (284B MoE) ships under the MIT, which permits commercial use. Always read the license text before deployment.

What's the context length of DeepSeek V4 Flash (284B MoE)?

DeepSeek V4 Flash (284B MoE) supports a context window of 1,048,576 tokens (about 1049K).

Source: huggingface.co/deepseek-ai/DeepSeek-V4-Flash

Reviewed by RunLocalAI Editorial. See our editorial policy for how we research and verify model claims.

Related — keep moving

Before you buy

Verify DeepSeek V4 Flash (284B MoE) runs on your specific hardware before committing money.