HOW-TO · INF

How to allocate specific GPU memory limits per model

advanced15 minBy Fredoline Eruo
PREREQUISITES

Multiple models to run, Ollama or vLLM

What this does

When running multiple models, each must be constrained to a portion of VRAM. This guide covers per-instance memory budgeting using layer counts, environment variables, and MIG partitioning.

Steps

  1. Map VRAM per model. On a 24 GB GPU with models A (needs 8 GB) and B (needs 10 GB), reserve 6 GB for KV cache and system:

    # Model A: allocate ~8 GB
    ./llama-server -m model-a.gguf --n-gpu-layers 32 --port 8080
    # Model B: allocate ~10 GB
    ./llama-server -m model-b.gguf --n-gpu-layers 40 --port 8081
    
  2. Use Ollama's OLLAMA_MAX_VRAM per-session (if supported).

    OLLAMA_MAX_VRAM=8000000000 ollama run model-a
    # In another terminal:
    OLLAMA_MAX_VRAM=10000000000 ollama run model-b
    
  3. For vLLM, set gpu_memory_utilization per instance.

    python -m vllm.entrypoints.openai.api_server \
        --model meta-llama/Llama-3.2-3B \
        --gpu-memory-utilization 0.35 \
        --port 8000 &
    python -m vllm.entrypoints.openai.api_server \
        --model mistralai/Mistral-7B \
        --gpu-memory-utilization 0.50 \
        --port 8001 &
    
  4. Use NVIDIA MIG (Multi-Instance GPU) for hardware-level partitioning.

    sudo nvidia-smi mig -i 0 -cgi 1g.10gb,2g.20gb -C
    # Creates two GPU instances: 10 GB and 20 GB
    

Verification

nvidia-smi
# Expected: Two distinct processes listed, each consuming its allocated VRAM portion

Common failures

  • OLLAMA_MAX_VRAM not honored: Older Ollama versions ignore this variable. Upgrade to 0.5+ or use llama.cpp directly.
  • MIG not supported: Only A100, H100, and H200 GPUs support MIG. Use layer-based allocation instead.
  • Cumulative overshoot: If one model exceeds its budget, both may OOM. Set conservative limits with 15% headroom.

Operator checkpoint

Before treating this as solved, write down the local runtime, model or package version, hardware/backend if relevant, and the verification output. This keeps the guide useful as a Will-It-Run style decision instead of a one-off command transcript.

Operator checkpoint

Before treating this as solved, write down the local runtime, model or package version, hardware/backend if relevant, and the verification output. This keeps the guide useful as a Will-It-Run style decision instead of a one-off command transcript.

Related guides

RELATED GUIDES