What this does

Running multiple models concurrently enables multi-model workflows, A/B comparison, and serving different models for different tasks. This guide covers port separation, memory budgeting, and orchestration.

Steps

Launch each model on a dedicated port.

# Terminal 1: Coding model
./llama-server -m code-qwen2.5-coder.gguf --port 8080 --n-gpu-layers 40

# Terminal 2: Chat model
./llama-server -m llama3.2.gguf --port 8081 --n-gpu-layers 40

Limit VRAM per instance using --n-gpu-layers. Calculate per-model budget. For a 24 GB GPU running two 7B Q4 models (~6 GB each):
```
./llama-server -m model1.gguf --n-gpu-layers 20 --port 8080
./llama-server -m model2.gguf --n-gpu-layers 20 --port 8081
```

For Ollama, start multiple model sessions.

# Load model 1
ollama run llama3.2 &
# Load model 2 (Ollama keeps both in memory)
ollama run mistral &

Verify both are responding independently.

curl -s http://localhost:8080/completion -d '{"prompt": "Hello from model 1"}'
curl -s http://localhost:8081/completion -d '{"prompt": "Hello from model 2"}'

Verification

nvidia-smi --query-gpu=memory.used --format=csv,noheader
# Expected: VRAM usage equals sum of both model footprints (e.g., 12 GB if each uses 6 GB)

Common failures

VRAM oversubscription: Models combined exceed VRAM, causing swapping. Reduce layers per model or use smaller quantizations.
Port conflicts: Ensure each server uses a unique port. Use netstat -ano | findstr :8080 to check.
llama-server fails to bind: Another process occupies the port. Use --port 0 for auto-assignment, then check logs.

Operator checkpoint

Before treating this as solved, write down the local runtime, model or package version, hardware/backend if relevant, and the verification output. This keeps the guide useful as a Will-It-Run style decision instead of a one-off command transcript.

How to run multiple models simultaneously on the same system

What this does

Steps

Verification

Common failures

Operator checkpoint

Operator checkpoint

Related guides