RUNLOCALAIv38
->Will it run?Best GPUCompareTroubleshootStartLearnPulseModelsHardwareToolsBench
Run check
RUNLOCALAI

Independently operated catalog for local-AI hardware and software. Hand-written verdicts. Source-cited claims. Reproducible commands when we have them.

OP·Fredoline Eruo
DIR
  • Models
  • Hardware
  • Tools
  • Benchmarks
TOOLS
  • Will it run?
  • Compare hardware
  • Cost vs cloud
  • Choose my GPU
  • Prompting kits
  • Quick answers
REF
  • All buyer guides
  • Learn local AI
  • Methodology
  • Glossary
  • Errors KB
  • Trust
EDITOR
  • About
  • Author
  • How we make money
  • Editorial policy
  • Contact
LEGAL
  • Privacy
  • Terms
  • Sitemap
MAIL · MONTHLY DIGEST
Get monthly local AI changes
Monthly recap. No spam.
DISCLOSURE

Some links on this site are affiliate links (Amazon Associates and other first-class retailers). When you buy through them, we earn a small commission at no extra cost to you. Affiliate links do not influence our verdicts — there are cards we rate highly that we don't have affiliate relationships with, and cards that sell well that we refuse to recommend. Read more →

© 2026 runlocalai.coIndependently operated
RUNLOCALAI · v38
  1. >
  2. Home
  3. /Learn
  4. /Courses
  5. /Custom Training Pipelines
  6. /Ch. 6
Custom Training Pipelines

06. Multi-GPU Training

Chapter 6 of 18 · 20 min
KEY INSIGHT

Distributed training multiplies batch size by GPU count—scale the learning rate linearly or face convergence failures.

Multi-GPU training accelerates training but introduces complexity that breaks single-GPU code in subtle ways. Understanding the failure modes prevents debugging nightmares.

Choosing a Strategy

Strategy Use When Not When
Data Parallelism Fast GPUs, small models Large models that don't fit on one GPU
Model Parallelism Very large models Small models (overhead dominates)
FSDP Large models, good interconnects Slow interconnects (e.g., older AWS instances)

Setting Up Distributed Training

import torch.distributed as dist
import os

def setup_distributed():
    """Initialize process group for distributed training."""
    local_rank = int(os.environ["LOCAL_RANK"])
    torch.cuda.set_device(local_rank)
    
    dist.init_process_group(
        backend="nccl",           # NVIDIA GPUs: always use NCCL
        init_method="env://",
        world_size=int(os.environ["WORLD_SIZE"]),
        rank=int(os.environ["RANK"])
    )
    
    return local_rank

def cleanup_distributed():
    """Clean up process group."""
    dist.destroy_process_group()

Launching Distributed Jobs

# Single node, 4 GPUs
torchrun --nproc_per_node=4 train.py

# Multi-node (2 nodes, 4 GPUs each)
torchrun \
    --nproc_per_node=4 \
    --nnodes=2 \
    --node_rank=0 \
    --master_addr="10.0.0.1" \
    --master_port=29500 \
    train.py

Common Failure: Batch Size Confusion

In distributed training, the effective batch size is batch_size * num_gpus. A 32-batch-size config trained on 4 GPUs uses an effective batch of 128. Scale learning rate accordingly—linear scaling works for most cases:

# Proper scaling
NUM_GPUS = torch.cuda.device_count()
effective_batch_size = config.batch_size * NUM_GPUS
scaled_lr = config.base_lr * (effective_batch_size / config.base_batch_size)

Local verification checkpoint

Run the smallest example from this chapter in a local workspace and record the package version, runtime, data path, and observed output. If the result depends on model size, vector count, CPU/GPU backend, or available memory, note that constraint beside the exercise so the lesson remains reproducible.

Local verification checkpoint

Run the smallest example from this chapter in a local workspace and record the package version, runtime, data path, and observed output. If the result depends on model size, vector count, CPU/GPU backend, or available memory, note that constraint beside the exercise so the lesson remains reproducible.

EXERCISE

Run torchrun --nproc_per_node=2 on a minimal training script that prints GPU rank. Verify all ranks execute and print their rank correctly.

← Chapter 5
Dataset Streaming
Chapter 7 →
Data Parallelism