Multi-GPU Setup — Hardware Planning for Local AI (Chapter 18)

Multi-GPU configurations multiply VRAM and compute capacity but add complexity. Understanding scaling efficiency helps justify the added cost.

Scaling Efficiency

Multi-GPU inference efficiency varies by method:

GPUs	Theoretical Speedup	Typical Measured	Cause
2x same model	1.85x	1.7-1.8x	PCIe bottleneck
4x same model	3.5x	2.5-3.0x	Communication overhead
Tensor parallel	1.9x	1.6-1.8x	All-reduce operations

Tensor parallelism (splitting a single model across GPUs) has higher overhead than pipeline parallelism (splitting layers across GPUs).

Hardware Requirements

Minimum for 2-GPU setup:

Component	Requirement
CPU PCIe lanes	16 per slot (20+ total for 2 GPUs)
Motherboard	Must support PCIe bifurcation
PSU	1200W+ (dual 450W GPUs plus system)
Case	Full-tower with 4+ PCIe slots visible
Cooling	6+ case fans, or liquid cooling

PCIe Topology

# Verify PCIe topology on Linux
lspci -t

# Example output for dual RTX 4090:
# ┌─[0000:00]─[0001:00]─[0002:00]─[0002:01]─[0002:02] NVIDIA Tesla
# │                         ─[0002:03] NVIDIA Tesla  
# └─[0001:01]─[0001:01] NVMe storage

Both GPUs should be at PCIe 4.0 x16. Check via:

nvidia-smi -q -i 0,1 -x | grep -E "Link.*Current|Link.*Max"

llama.cpp Multi-GPU Configuration

# llama.cpp with multiple GPUs
./llama-server \
    -m models/llama-3-70b-instruct-q4_k_m.gguf \
    -ngl 999 \
    -t 16 \
    -c 4096

# Internal splitting for larger models
# Model layers divided across available GPUs

sharded Weights Alternative

Load different models on each GPU for parallel serving:

# GPU 0: Llama 3 13B
./llama-server -m models/llama-3-13b-q4_k_m.gguf -ngl 999 -po 0 -c 2048 &

# GPU 1: Mistral 7B
./llama-server -m models/mistral-7b-q4_k_m.gguf -ngl 999 -po 1 -c 2048 &

# Routes requests to appropriate GPU based on port