18. Multi-GPU Setup
Chapter 18 of 20 · 20 min
Multi-GPU configurations multiply VRAM and compute capacity but add complexity. Understanding scaling efficiency helps justify the added cost.
Scaling Efficiency
Multi-GPU inference efficiency varies by method:
| GPUs | Theoretical Speedup | Typical Measured | Cause |
|---|---|---|---|
| 2x same model | 1.85x | 1.7-1.8x | PCIe bottleneck |
| 4x same model | 3.5x | 2.5-3.0x | Communication overhead |
| Tensor parallel | 1.9x | 1.6-1.8x | All-reduce operations |
Tensor parallelism (splitting a single model across GPUs) has higher overhead than pipeline parallelism (splitting layers across GPUs).
Hardware Requirements
Minimum for 2-GPU setup:
| Component | Requirement |
|---|---|
| CPU PCIe lanes | 16 per slot (20+ total for 2 GPUs) |
| Motherboard | Must support PCIe bifurcation |
| PSU | 1200W+ (dual 450W GPUs plus system) |
| Case | Full-tower with 4+ PCIe slots visible |
| Cooling | 6+ case fans, or liquid cooling |
PCIe Topology
# Verify PCIe topology on Linux
lspci -t
# Example output for dual RTX 4090:
# ┌─[0000:00]─[0001:00]─[0002:00]─[0002:01]─[0002:02] NVIDIA Tesla
# │ ─[0002:03] NVIDIA Tesla
# └─[0001:01]─[0001:01] NVMe storage
Both GPUs should be at PCIe 4.0 x16. Check via:
nvidia-smi -q -i 0,1 -x | grep -E "Link.*Current|Link.*Max"
llama.cpp Multi-GPU Configuration
# llama.cpp with multiple GPUs
./llama-server \
-m models/llama-3-70b-instruct-q4_k_m.gguf \
-ngl 999 \
-t 16 \
-c 4096
# Internal splitting for larger models
# Model layers divided across available GPUs
sharded Weights Alternative
Load different models on each GPU for parallel serving:
# GPU 0: Llama 3 13B
./llama-server -m models/llama-3-13b-q4_k_m.gguf -ngl 999 -po 0 -c 2048 &
# GPU 1: Mistral 7B
./llama-server -m models/mistral-7b-q4_k_m.gguf -ngl 999 -po 1 -c 2048 &
# Routes requests to appropriate GPU based on port
EXERCISE
Calculate the cost difference between a single RTX 4090 24GB configuration and a dual-RTX 3090 24GB configuration. Compare performance for running Llama 3 70B INT4.