DeepSeek family — local AI ecosystem · RunLocalAI

Start with DeepSeek R1-Distill-Qwen-32B at Q4_K_M via Ollama — it runs on a single RTX 4090 24 GB, delivers MMLU 86.1% and MATH ~89%, and captures ~75% of full DeepSeek V3 reasoning quality at 1/20th the VRAM. For lighter hardware (<16 GB VRAM), use R1-Distill-Llama-8B at Q5_K_M — ~6 GB, fits on MacBook Pro M4 Max at 25+ tok/s. Skip full-scale DeepSeek V3/V4 MoE (671B–1T params) for local deployment — Q4 requires ~380 GB VRAM minimum and decode drops to ~3-4 tok/s even on Mac Studio M3 Ultra. The distilled variants are the pragmatic entry point for 90% of users. Skip DeepSeek Coder V3 unless you specifically need FIM (fill-in-the-middle) code completion — the base V3 and distilled variants handle code generation competitively.

For single-user local: Ollama + deepseek-r1:32b Q4_K_M on RTX 4090 24 GB or Apple M3 Ultra via MLX-LM. Distilled variants are standard dense architectures — use same deployment stack as base (Llama or Qwen). For multi-user MoE serving: vLLM 0.6.3+ with FP8 DeepSeek V3 MLA kernel on 4× H100 SXM — ~6,000 tok/s at batch 32 with expert parallelism. Enable multi-token prediction (MTP) for single-user throughput (+1.8×). For datacenter MoE: TensorRT-LLM 0.12.0+ FP8 on 8× H100 SXM — ~18,000 tok/s at batch 128. Never quantize MoE router weights below FP16. ExLlamaV2 does not support DeepSeek MoE — use vLLM or SGLang. See GPU buyer guide.

When it doesn't work

DeepSeek

Best entry point for local use

Deployment guidance

Featured models

Recommended runtimes

Related families

Related — keep moving