What this does

Large DeepSeek-family models can exceed consumer VRAM quickly. This guide covers quantization, layer offloading, and context limiting to fit them into constrained memory budgets.

Steps

Select the most memory-efficient quantization. For 16 GB VRAM, start with a smaller distill or a quantized model that your runtime reports as fitting.
```
ollama pull deepseek-r1:14b
```
Limit context window to reduce KV cache size. KV cache memory grows with context length, model size, precision, and runtime settings.
```
ollama run deepseek-r1:14b
/set parameter num_ctx 2048
```

Offload layers to CPU when VRAM is tight.

ollama run deepseek-r1:14b --n-gpu-layers 24

Use vLLM with memory budget flags for finer control.

python -m vllm.entrypoints.openai.api_server \
    --model deepseek-ai/DeepSeek-V3-0324 \
    --gpu-memory-utilization 0.80 \
    --max-model-len 8192 \
    --enforce-eager

Verification

nvidia-smi --query-gpu=memory.used --format=csv,noheader
# Expected: memory usage stays within your VRAM budget (e.g., < 16 GB)

Common failures

VRAM still exceeded: Reduce num_ctx further (512 minimum) or offload more layers to CPU.
CPU inference too slow: Enable --num-threads matching your CPU core count.
Model fails to load with 1.58-bit: Ensure Ollama 0.5+ is installed; older versions lack dynamic quantization support.

Operator checkpoint

Before treating this as solved, write down the local runtime, model or package version, hardware/backend if relevant, and the verification output. This keeps the guide useful as a Will-It-Run style decision instead of a one-off command transcript.

How to configure DeepSeek models for reduced memory usage

What this does

Steps

Verification

Common failures

Operator checkpoint

Operator checkpoint

Related guides