HOW-TO · INF

How to configure DeepSeek models for reduced memory usage

intermediate15 minBy Fredoline Eruo
PREREQUISITES

DeepSeek model pulled, knowledge of VRAM limits

What this does

Large DeepSeek-family models can exceed consumer VRAM quickly. This guide covers quantization, layer offloading, and context limiting to fit them into constrained memory budgets.

Steps

  1. Select the most memory-efficient quantization. For 16 GB VRAM, start with a smaller distill or a quantized model that your runtime reports as fitting.

    ollama pull deepseek-r1:14b
    
  2. Limit context window to reduce KV cache size. KV cache memory grows with context length, model size, precision, and runtime settings.

    ollama run deepseek-r1:14b
    /set parameter num_ctx 2048
    
  3. Offload layers to CPU when VRAM is tight.

    ollama run deepseek-r1:14b --n-gpu-layers 24
    
  4. Use vLLM with memory budget flags for finer control.

    python -m vllm.entrypoints.openai.api_server \
        --model deepseek-ai/DeepSeek-V3-0324 \
        --gpu-memory-utilization 0.80 \
        --max-model-len 8192 \
        --enforce-eager
    

Verification

nvidia-smi --query-gpu=memory.used --format=csv,noheader
# Expected: memory usage stays within your VRAM budget (e.g., < 16 GB)

Common failures

  • VRAM still exceeded: Reduce num_ctx further (512 minimum) or offload more layers to CPU.
  • CPU inference too slow: Enable --num-threads matching your CPU core count.
  • Model fails to load with 1.58-bit: Ensure Ollama 0.5+ is installed; older versions lack dynamic quantization support.

Operator checkpoint

Before treating this as solved, write down the local runtime, model or package version, hardware/backend if relevant, and the verification output. This keeps the guide useful as a Will-It-Run style decision instead of a one-off command transcript.

Operator checkpoint

Before treating this as solved, write down the local runtime, model or package version, hardware/backend if relevant, and the verification output. This keeps the guide useful as a Will-It-Run style decision instead of a one-off command transcript.

Related guides

RELATED GUIDES