How to allocate specific GPU memory limits per model
Multiple models to run, Ollama or vLLM
What this does
When running multiple models, each must be constrained to a portion of VRAM. This guide covers per-instance memory budgeting using layer counts, environment variables, and MIG partitioning.
Steps
Map VRAM per model. On a 24 GB GPU with models A (needs 8 GB) and B (needs 10 GB), reserve 6 GB for KV cache and system:
# Model A: allocate ~8 GB ./llama-server -m model-a.gguf --n-gpu-layers 32 --port 8080 # Model B: allocate ~10 GB ./llama-server -m model-b.gguf --n-gpu-layers 40 --port 8081Use Ollama's
OLLAMA_MAX_VRAMper-session (if supported).OLLAMA_MAX_VRAM=8000000000 ollama run model-a # In another terminal: OLLAMA_MAX_VRAM=10000000000 ollama run model-bFor vLLM, set
gpu_memory_utilizationper instance.python -m vllm.entrypoints.openai.api_server \ --model meta-llama/Llama-3.2-3B \ --gpu-memory-utilization 0.35 \ --port 8000 & python -m vllm.entrypoints.openai.api_server \ --model mistralai/Mistral-7B \ --gpu-memory-utilization 0.50 \ --port 8001 &Use NVIDIA MIG (Multi-Instance GPU) for hardware-level partitioning.
sudo nvidia-smi mig -i 0 -cgi 1g.10gb,2g.20gb -C # Creates two GPU instances: 10 GB and 20 GB
Verification
nvidia-smi
# Expected: Two distinct processes listed, each consuming its allocated VRAM portion
Common failures
- OLLAMA_MAX_VRAM not honored: Older Ollama versions ignore this variable. Upgrade to 0.5+ or use llama.cpp directly.
- MIG not supported: Only A100, H100, and H200 GPUs support MIG. Use layer-based allocation instead.
- Cumulative overshoot: If one model exceeds its budget, both may OOM. Set conservative limits with 15% headroom.
Operator checkpoint
Before treating this as solved, write down the local runtime, model or package version, hardware/backend if relevant, and the verification output. This keeps the guide useful as a Will-It-Run style decision instead of a one-off command transcript.
Operator checkpoint
Before treating this as solved, write down the local runtime, model or package version, hardware/backend if relevant, and the verification output. This keeps the guide useful as a Will-It-Run style decision instead of a one-off command transcript.