How to configure DeepSeek models for reduced memory usage
DeepSeek model pulled, knowledge of VRAM limits
What this does
Large DeepSeek-family models can exceed consumer VRAM quickly. This guide covers quantization, layer offloading, and context limiting to fit them into constrained memory budgets.
Steps
Select the most memory-efficient quantization. For 16 GB VRAM, start with a smaller distill or a quantized model that your runtime reports as fitting.
ollama pull deepseek-r1:14bLimit context window to reduce KV cache size. KV cache memory grows with context length, model size, precision, and runtime settings.
ollama run deepseek-r1:14b /set parameter num_ctx 2048Offload layers to CPU when VRAM is tight.
ollama run deepseek-r1:14b --n-gpu-layers 24Use vLLM with memory budget flags for finer control.
python -m vllm.entrypoints.openai.api_server \ --model deepseek-ai/DeepSeek-V3-0324 \ --gpu-memory-utilization 0.80 \ --max-model-len 8192 \ --enforce-eager
Verification
nvidia-smi --query-gpu=memory.used --format=csv,noheader
# Expected: memory usage stays within your VRAM budget (e.g., < 16 GB)
Common failures
- VRAM still exceeded: Reduce
num_ctxfurther (512 minimum) or offload more layers to CPU. - CPU inference too slow: Enable
--num-threadsmatching your CPU core count. - Model fails to load with 1.58-bit: Ensure Ollama 0.5+ is installed; older versions lack dynamic quantization support.
Operator checkpoint
Before treating this as solved, write down the local runtime, model or package version, hardware/backend if relevant, and the verification output. This keeps the guide useful as a Will-It-Run style decision instead of a one-off command transcript.
Operator checkpoint
Before treating this as solved, write down the local runtime, model or package version, hardware/backend if relevant, and the verification output. This keeps the guide useful as a Will-It-Run style decision instead of a one-off command transcript.