How to run multiple models simultaneously on the same system
Sufficient VRAM for all target models combined
What this does
Running multiple models concurrently enables multi-model workflows, A/B comparison, and serving different models for different tasks. This guide covers port separation, memory budgeting, and orchestration.
Steps
Launch each model on a dedicated port.
# Terminal 1: Coding model ./llama-server -m code-qwen2.5-coder.gguf --port 8080 --n-gpu-layers 40 # Terminal 2: Chat model ./llama-server -m llama3.2.gguf --port 8081 --n-gpu-layers 40Limit VRAM per instance using
--n-gpu-layers. Calculate per-model budget. For a 24 GB GPU running two 7B Q4 models (~6 GB each):./llama-server -m model1.gguf --n-gpu-layers 20 --port 8080 ./llama-server -m model2.gguf --n-gpu-layers 20 --port 8081For Ollama, start multiple model sessions.
# Load model 1 ollama run llama3.2 & # Load model 2 (Ollama keeps both in memory) ollama run mistral &Verify both are responding independently.
curl -s http://localhost:8080/completion -d '{"prompt": "Hello from model 1"}' curl -s http://localhost:8081/completion -d '{"prompt": "Hello from model 2"}'
Verification
nvidia-smi --query-gpu=memory.used --format=csv,noheader
# Expected: VRAM usage equals sum of both model footprints (e.g., 12 GB if each uses 6 GB)
Common failures
- VRAM oversubscription: Models combined exceed VRAM, causing swapping. Reduce layers per model or use smaller quantizations.
- Port conflicts: Ensure each server uses a unique port. Use
netstat -ano | findstr :8080to check. - llama-server fails to bind: Another process occupies the port. Use
--port 0for auto-assignment, then check logs.
Operator checkpoint
Before treating this as solved, write down the local runtime, model or package version, hardware/backend if relevant, and the verification output. This keeps the guide useful as a Will-It-Run style decision instead of a one-off command transcript.
Operator checkpoint
Before treating this as solved, write down the local runtime, model or package version, hardware/backend if relevant, and the verification output. This keeps the guide useful as a Will-It-Run style decision instead of a one-off command transcript.