What this does

DeepSeek MoE models use Mixture-of-Experts architecture, activating only a subset of parameters per token. This guide shows how to pull and run these models with optimal efficiency on consumer hardware.

Steps

Pull a quantized DeepSeek MoE variant. Quantized versions drastically reduce memory requirements.
```
ollama pull deepseek-r1:14b
```
Expected: Download progress, then model registered in local store.
Verify the model pulls correctly.
```
ollama list
```
Expected: deepseek-r1:14b appears with file size ~40 GB.
Run with minimal context to reduce memory pressure.
```
ollama run deepseek-r1:14b
```
Inside the session, set a shorter context:
```
/set parameter num_ctx 4096
```
Measure active memory usage.
```
ollama ps
```
Expected: Shows memory consumed by the running model. MoE models can reduce active compute, but total load memory still depends on the packaged weights and runtime.

Verification

ollama ps
# Expected output: deepseek-r1:14b running with memory 24-32 GB (activates ~37B parameters)

Common failures

Out of memory during load: Reduce num_ctx to 2048 or use a smaller quantized variant (q3, q2).
Model not found: Verify the exact tag name with ollama search deepseek-v3.
Slow inference: MoE models benefit from GPU offloading. Set --n-gpu-layers appropriately.
Disk space exhausted: Delete unused models with ollama rm <model>.

Operator checkpoint

Before treating this as solved, write down the local runtime, model or package version, hardware/backend if relevant, and the verification output. This keeps the guide useful as a Will-It-Run style decision instead of a one-off command transcript.

How to pull and run DeepSeek MoE models efficiently

What this does

Steps

Verification

Common failures

Operator checkpoint

Operator checkpoint

Related guides