What this does

When running multiple models, each must be constrained to a portion of VRAM. This guide covers per-instance memory budgeting using layer counts, environment variables, and MIG partitioning.

Steps

Map VRAM per model. On a 24 GB GPU with models A (needs 8 GB) and B (needs 10 GB), reserve 6 GB for KV cache and system:

# Model A: allocate ~8 GB
./llama-server -m model-a.gguf --n-gpu-layers 32 --port 8080
# Model B: allocate ~10 GB
./llama-server -m model-b.gguf --n-gpu-layers 40 --port 8081

Use Ollama's OLLAMA_MAX_VRAM per-session (if supported).

OLLAMA_MAX_VRAM=8000000000 ollama run model-a
# In another terminal:
OLLAMA_MAX_VRAM=10000000000 ollama run model-b

For vLLM, set gpu_memory_utilization per instance.

python -m vllm.entrypoints.openai.api_server \
    --model meta-llama/Llama-3.2-3B \
    --gpu-memory-utilization 0.35 \
    --port 8000 &
python -m vllm.entrypoints.openai.api_server \
    --model mistralai/Mistral-7B \
    --gpu-memory-utilization 0.50 \
    --port 8001 &

Use NVIDIA MIG (Multi-Instance GPU) for hardware-level partitioning.

sudo nvidia-smi mig -i 0 -cgi 1g.10gb,2g.20gb -C
# Creates two GPU instances: 10 GB and 20 GB

Verification

nvidia-smi
# Expected: Two distinct processes listed, each consuming its allocated VRAM portion

Common failures

OLLAMA_MAX_VRAM not honored: Older Ollama versions ignore this variable. Upgrade to 0.5+ or use llama.cpp directly.
MIG not supported: Only A100, H100, and H200 GPUs support MIG. Use layer-based allocation instead.
Cumulative overshoot: If one model exceeds its budget, both may OOM. Set conservative limits with 15% headroom.

Operator checkpoint

Before treating this as solved, write down the local runtime, model or package version, hardware/backend if relevant, and the verification output. This keeps the guide useful as a Will-It-Run style decision instead of a one-off command transcript.

How to allocate specific GPU memory limits per model

What this does

Steps

Verification

Common failures

Operator checkpoint

Operator checkpoint

Related guides