RUNLOCALAIv38
->Will it run?Best GPUCompareTroubleshootStartLearnPulseModelsHardwareToolsBench
Run check
RUNLOCALAI

Independently operated catalog for local-AI hardware and software. Hand-written verdicts. Source-cited claims. Reproducible commands when we have them.

OP·Fredoline Eruo
DIR
  • Models
  • Hardware
  • Tools
  • Benchmarks
TOOLS
  • Will it run?
  • Compare hardware
  • Cost vs cloud
  • Choose my GPU
  • Prompting kits
  • Quick answers
REF
  • All buyer guides
  • Learn local AI
  • Methodology
  • Glossary
  • Errors KB
  • Trust
EDITOR
  • About
  • Author
  • How we make money
  • Editorial policy
  • Contact
LEGAL
  • Privacy
  • Terms
  • Sitemap
MAIL · MONTHLY DIGEST
Get monthly local AI changes
Monthly recap. No spam.
DISCLOSURE

Some links on this site are affiliate links (Amazon Associates and other first-class retailers). When you buy through them, we earn a small commission at no extra cost to you. Affiliate links do not influence our verdicts — there are cards we rate highly that we don't have affiliate relationships with, and cards that sell well that we refuse to recommend. Read more →

© 2026 runlocalai.coIndependently operated
RUNLOCALAI · v38
Glossary / Notable models & companies / DeepSeek
Notable models & companies

DeepSeek

DeepSeek is a family of open-weight large language models developed by DeepSeek (深度求索), a Chinese AI research company. The models range from small (e.g., DeepSeek-R1-Distill-Qwen-1.5B) to massive (DeepSeek-V3 with 671B total parameters, 37B activated per token). They are known for strong reasoning performance, especially the DeepSeek-R1 series which uses reinforcement learning to improve chain-of-thought reasoning. Operators encounter DeepSeek models as downloadable weights on Hugging Face, runnable via llama.cpp, Ollama, vLLM, or MLX. The models require significant VRAM: the full V3 at FP16 needs ~1.3 TB, but quantized versions (e.g., Q4_K_M) fit in ~400 GB, still requiring multi-GPU setups or high-RAM servers.

Deeper dive

DeepSeek models are notable for their Mixture-of-Experts (MoE) architecture in the V3 and R1 families. The V3 model uses 256 experts with top-2 routing per token, meaning only 37B of the 671B parameters are active per forward pass. This reduces compute cost while maintaining high capacity. The R1 series adds reinforcement learning to improve reasoning traces, often producing longer chain-of-thought outputs. Distilled versions (e.g., DeepSeek-R1-Distill-Qwen-7B) are smaller, dense models fine-tuned on R1 outputs, making them more accessible on consumer hardware. Operators should note that DeepSeek models are released under a permissive license (MIT for most), allowing commercial use. However, the larger models require careful VRAM planning: a 4-bit quantized V3 (~400 GB) needs at least 4× 80GB GPUs or CPU offloading with significant RAM.

Practical example

An operator with a single RTX 4090 (24 GB VRAM) can run DeepSeek-R1-Distill-Qwen-7B at Q4_K_M (~5 GB) with 4K context, achieving ~30-40 tok/s. The full DeepSeek-R1 (671B) at Q4_K_M requires ~400 GB VRAM, so it would need 5× 80GB A100s or 10× 40GB A100s. On a Mac Studio with 128 GB unified memory, MLX can run the 7B distilled model at ~20 tok/s, but the full model is impractical.

Workflow example

To run DeepSeek-R1-Distill-Qwen-7B via Ollama: ollama pull deepseek-r1:7b downloads ~4.7 GB of quantized weights. Then ollama run deepseek-r1:7b loads the model into VRAM. If VRAM is insufficient, Ollama offloads to system RAM, dropping tokens/sec from ~35 to ~5. For the full V3, operators use vLLM with tensor parallelism across multiple GPUs: vllm serve deepseek-ai/DeepSeek-V3 --tensor-parallel-size 4.

Reviewed by Fredoline Eruo. See our editorial policy.

Buyer guides
  • Best GPU for local AI →
  • Best laptop for local AI →
  • Best Mac for local AI →
When it doesn't work
  • CUDA out of memory →
  • Ollama running slowly →
  • ROCm not detected →