How to benchmark DeepSeek against dense models of similar size
DeepSeek MoE model and comparable dense model downloaded
What this does
Compares throughput, memory usage, and response quality between a DeepSeek MoE model and a dense model with comparable active-parameter count. MoE activations are ~37B, so Llama-3-70B is the natural dense counterpart.
Steps
Run the DeepSeek MoE model in one terminal.
ollama run deepseek-r1:14bRun the dense model in a second terminal.
ollama run llama3:70bExecute a standardized benchmark script.
import time, requests, statistics def benchmark(model, prompt, runs=5): latencies = [] for _ in range(runs): start = time.perf_counter() requests.post("http://localhost:11434/api/generate", json={"model": model, "prompt": prompt, "stream": False}) latencies.append(time.perf_counter() - start) return statistics.mean(latencies), statistics.stdev(latencies) prompts = ["Write a Python quicksort", "Explain quantum entanglement", "Summarize the history of Rome"] for model in ["deepseek-r1:14b", "llama3:70b"]: for p in prompts: mean, std = benchmark(model, p) print(f"{model} | {p[:30]}... | {mean:.2f}s ± {std:.2f}s")Measure peak memory for each.
ollama ps nvidia-smi --query-gpu=memory.used --format=csv
Verification
# Expected: DeepSeek uses less memory (~30 GB vs ~45 GB for dense) with comparable latency
ollama ps
nvidia-smi
Common failures
- Unfair comparison: Ensure both use the same quantization level (q4_k_m) and context length.
- Prompt cache interference: Run warm-up prompts before timing to avoid cold-start skew.
- Memory swap on MoE: If VRAM is insufficient, MoE models degrade more gracefully than dense models due to smaller activations.
Operator checkpoint
Before treating this as solved, write down the local runtime, model or package version, hardware/backend if relevant, and the verification output. This keeps the guide useful as a Will-It-Run style decision instead of a one-off command transcript.
Operator checkpoint
Before treating this as solved, write down the local runtime, model or package version, hardware/backend if relevant, and the verification output. This keeps the guide useful as a Will-It-Run style decision instead of a one-off command transcript.