08. Model Parallelism

Chapter 8 of 18 · 20 min

When a model doesn't fit in a single GPU's memory, model parallelism splits the model itself across GPUs. This requires architectural changes and introduces pipeline bubbles.

Horizontal Splitting

Split layers across GPUs by assigning different layers to different devices:

class ModelParallelResNet(nn.Module):
    def __init__(self, num_classes=1000):
        super().__init__()
        self.stem = nn.Sequential(
            nn.Conv2d(3, 64, 7, 2, 3),
            nn.BatchNorm2d(64),
            nn.ReLU()
        ).cuda(0)
        
        # Layers 1-3 on GPU 0
        self.layer1 = self._make_layer(64, 64, 3).cuda(0)
        self.layer2 = self._make_layer(64, 128, 4).cuda(0)
        
        # Layers 4-5 on GPU 1
        self.layer3 = self._make_layer(128, 256, 6).cuda(1)
        self.layer4 = self._make_layer(256, 512, 3).cuda(1)
        
        self.head = nn.Linear(512, num_classes).cuda(1)
    
    def forward(self, x):
        x = self.stem(x)
        x = self.layer1(x)
        x = self.layer2(x)
        x = x.cuda(1)  # Transfer to second GPU
        x = self.layer3(x)
        x = self.layer4(x)
        x = self.head(x)
        return x

The Transfer Bottleneck

Explicit cuda() transfers between GPUs are slow. Use torch.cuda.Stream to overlap transfers with computation:

class AsyncModelParallel(nn.Module):
    def forward(self, x):
        # Compute on GPU 0
        with torch.cuda.stream(self.stream_compute):
            h = self.gpu0_layers(x)
        
        # Transfer to GPU 1
        with torch.cuda.stream(self.stream_transfer):
            h_gpu1 = h.cuda(1)
        
        # Compute on GPU 1
        with torch.cuda.stream(self.stream_compute):
            out = self.gpu1_layers(h_gpu1)
        
        return out

Pipeline Parallelism

GPipe and PipeDream split the model into stages, processing multiple micro-batches in a pipeline to hide transfer latency:

# Simplified PipeDream-style pipeline
def pipeline_step(model, micro_batches):
    outputs = []
    for i, mb in enumerate(micro_batches):
        out = model(mb)
        outputs.append(out)
        # As soon as mb1 finishes GPU0, mb2 starts
        # This hides transfer latency with computation
    return outputs

Local verification checkpoint

Run the smallest example from this chapter in a local workspace and record the package version, runtime, data path, and observed output. If the result depends on model size, vector count, CPU/GPU backend, or available memory, note that constraint beside the exercise so the lesson remains reproducible.

EXERCISE

Profile a model-parallel forward pass with torch.cuda.synchronize() and torch.profiler. Identify the exact time spent on CPU-to-GPU transfers.