DeepSeek
DeepSeek is a family of open-weight large language models developed by DeepSeek (深度求索), a Chinese AI research company. The models range from small (e.g., DeepSeek-R1-Distill-Qwen-1.5B) to massive (DeepSeek-V3 with 671B total parameters, 37B activated per token). They are known for strong reasoning performance, especially the DeepSeek-R1 series which uses reinforcement learning to improve chain-of-thought reasoning. Operators encounter DeepSeek models as downloadable weights on Hugging Face, runnable via llama.cpp, Ollama, vLLM, or MLX. The models require significant VRAM: the full V3 at FP16 needs ~1.3 TB, but quantized versions (e.g., Q4_K_M) fit in ~400 GB, still requiring multi-GPU setups or high-RAM servers.
Deeper dive
DeepSeek models are notable for their Mixture-of-Experts (MoE) architecture in the V3 and R1 families. The V3 model uses 256 experts with top-2 routing per token, meaning only 37B of the 671B parameters are active per forward pass. This reduces compute cost while maintaining high capacity. The R1 series adds reinforcement learning to improve reasoning traces, often producing longer chain-of-thought outputs. Distilled versions (e.g., DeepSeek-R1-Distill-Qwen-7B) are smaller, dense models fine-tuned on R1 outputs, making them more accessible on consumer hardware. Operators should note that DeepSeek models are released under a permissive license (MIT for most), allowing commercial use. However, the larger models require careful VRAM planning: a 4-bit quantized V3 (~400 GB) needs at least 4× 80GB GPUs or CPU offloading with significant RAM.
Practical example
An operator with a single RTX 4090 (24 GB VRAM) can run DeepSeek-R1-Distill-Qwen-7B at Q4_K_M (~5 GB) with 4K context, achieving ~30-40 tok/s. The full DeepSeek-R1 (671B) at Q4_K_M requires ~400 GB VRAM, so it would need 5× 80GB A100s or 10× 40GB A100s. On a Mac Studio with 128 GB unified memory, MLX can run the 7B distilled model at ~20 tok/s, but the full model is impractical.
Workflow example
To run DeepSeek-R1-Distill-Qwen-7B via Ollama: ollama pull deepseek-r1:7b downloads ~4.7 GB of quantized weights. Then ollama run deepseek-r1:7b loads the model into VRAM. If VRAM is insufficient, Ollama offloads to system RAM, dropping tokens/sec from ~35 to ~5. For the full V3, operators use vLLM with tensor parallelism across multiple GPUs: vllm serve deepseek-ai/DeepSeek-V3 --tensor-parallel-size 4.
Reviewed by Fredoline Eruo. See our editorial policy.