08. Model Parallelism
When a model doesn't fit in a single GPU's memory, model parallelism splits the model itself across GPUs. This requires architectural changes and introduces pipeline bubbles.
Horizontal Splitting
Split layers across GPUs by assigning different layers to different devices:
class ModelParallelResNet(nn.Module):
def __init__(self, num_classes=1000):
super().__init__()
self.stem = nn.Sequential(
nn.Conv2d(3, 64, 7, 2, 3),
nn.BatchNorm2d(64),
nn.ReLU()
).cuda(0)
# Layers 1-3 on GPU 0
self.layer1 = self._make_layer(64, 64, 3).cuda(0)
self.layer2 = self._make_layer(64, 128, 4).cuda(0)
# Layers 4-5 on GPU 1
self.layer3 = self._make_layer(128, 256, 6).cuda(1)
self.layer4 = self._make_layer(256, 512, 3).cuda(1)
self.head = nn.Linear(512, num_classes).cuda(1)
def forward(self, x):
x = self.stem(x)
x = self.layer1(x)
x = self.layer2(x)
x = x.cuda(1) # Transfer to second GPU
x = self.layer3(x)
x = self.layer4(x)
x = self.head(x)
return x
The Transfer Bottleneck
Explicit cuda() transfers between GPUs are slow. Use torch.cuda.Stream to overlap transfers with computation:
class AsyncModelParallel(nn.Module):
def forward(self, x):
# Compute on GPU 0
with torch.cuda.stream(self.stream_compute):
h = self.gpu0_layers(x)
# Transfer to GPU 1
with torch.cuda.stream(self.stream_transfer):
h_gpu1 = h.cuda(1)
# Compute on GPU 1
with torch.cuda.stream(self.stream_compute):
out = self.gpu1_layers(h_gpu1)
return out
Pipeline Parallelism
GPipe and PipeDream split the model into stages, processing multiple micro-batches in a pipeline to hide transfer latency:
# Simplified PipeDream-style pipeline
def pipeline_step(model, micro_batches):
outputs = []
for i, mb in enumerate(micro_batches):
out = model(mb)
outputs.append(out)
# As soon as mb1 finishes GPU0, mb2 starts
# This hides transfer latency with computation
return outputs
Local verification checkpoint
Run the smallest example from this chapter in a local workspace and record the package version, runtime, data path, and observed output. If the result depends on model size, vector count, CPU/GPU backend, or available memory, note that constraint beside the exercise so the lesson remains reproducible.
Profile a model-parallel forward pass with torch.cuda.synchronize() and torch.profiler. Identify the exact time spent on CPU-to-GPU transfers.