RUNLOCALAIv38
->Will it run?Best GPUCompareTroubleshootStartLearnPulseModelsHardwareToolsBench
Run check
RUNLOCALAI

Independently operated catalog for local-AI hardware and software. Hand-written verdicts. Source-cited claims. Reproducible commands when we have them.

OP·Fredoline Eruo
DIR
  • Models
  • Hardware
  • Tools
  • Benchmarks
TOOLS
  • Will it run?
  • Compare hardware
  • Cost vs cloud
  • Choose my GPU
  • Prompting kits
  • Quick answers
REF
  • All buyer guides
  • Learn local AI
  • Methodology
  • Glossary
  • Errors KB
  • Trust
EDITOR
  • About
  • Author
  • How we make money
  • Editorial policy
  • Contact
LEGAL
  • Privacy
  • Terms
  • Sitemap
MAIL · MONTHLY DIGEST
Get monthly local AI changes
Monthly recap. No spam.
DISCLOSURE

Some links on this site are affiliate links (Amazon Associates and other first-class retailers). When you buy through them, we earn a small commission at no extra cost to you. Affiliate links do not influence our verdicts — there are cards we rate highly that we don't have affiliate relationships with, and cards that sell well that we refuse to recommend. Read more →

© 2026 runlocalai.coIndependently operated
RUNLOCALAI · v38
  1. >
  2. Home
  3. /Learn
  4. /Courses
  5. /Hardware Planning for Local AI
  6. /Ch. 18
Hardware Planning for Local AI

18. Multi-GPU Setup

Chapter 18 of 20 · 20 min
KEY INSIGHT

Multi-GPU setups rarely achieve linear scaling due to communication overhead—evaluate whether cost difference versus single high-end GPU justifies the complexity. ```bash # Monitor both GPUs during parallel inference nvidia-smi --query-gpu=index,name,utilization.gpu,memory.used,memory.total \ --format=csv # Split model by layer count for tensor parallel # 80 layers on 2 GPUs = 40 layers each # Total VRAM = GPU_VRAM * 2 (minus overhead) ```

Multi-GPU configurations multiply VRAM and compute capacity but add complexity. Understanding scaling efficiency helps justify the added cost.

Scaling Efficiency

Multi-GPU inference efficiency varies by method:

GPUs Theoretical Speedup Typical Measured Cause
2x same model 1.85x 1.7-1.8x PCIe bottleneck
4x same model 3.5x 2.5-3.0x Communication overhead
Tensor parallel 1.9x 1.6-1.8x All-reduce operations

Tensor parallelism (splitting a single model across GPUs) has higher overhead than pipeline parallelism (splitting layers across GPUs).

Hardware Requirements

Minimum for 2-GPU setup:

Component Requirement
CPU PCIe lanes 16 per slot (20+ total for 2 GPUs)
Motherboard Must support PCIe bifurcation
PSU 1200W+ (dual 450W GPUs plus system)
Case Full-tower with 4+ PCIe slots visible
Cooling 6+ case fans, or liquid cooling

PCIe Topology

# Verify PCIe topology on Linux
lspci -t

# Example output for dual RTX 4090:
# ┌─[0000:00]─[0001:00]─[0002:00]─[0002:01]─[0002:02] NVIDIA Tesla
# │                         ─[0002:03] NVIDIA Tesla  
# └─[0001:01]─[0001:01] NVMe storage

Both GPUs should be at PCIe 4.0 x16. Check via:

nvidia-smi -q -i 0,1 -x | grep -E "Link.*Current|Link.*Max"

llama.cpp Multi-GPU Configuration

# llama.cpp with multiple GPUs
./llama-server \
    -m models/llama-3-70b-instruct-q4_k_m.gguf \
    -ngl 999 \
    -t 16 \
    -c 4096

# Internal splitting for larger models
# Model layers divided across available GPUs

sharded Weights Alternative

Load different models on each GPU for parallel serving:

# GPU 0: Llama 3 13B
./llama-server -m models/llama-3-13b-q4_k_m.gguf -ngl 999 -po 0 -c 2048 &

# GPU 1: Mistral 7B
./llama-server -m models/mistral-7b-q4_k_m.gguf -ngl 999 -po 1 -c 2048 &

# Routes requests to appropriate GPU based on port
EXERCISE

Calculate the cost difference between a single RTX 4090 24GB configuration and a dual-RTX 3090 24GB configuration. Compare performance for running Llama 3 70B INT4.

← Chapter 17
Future-Proofing
Chapter 19 →
Hardware Benchmarking