02. Cluster Topology

Chapter 2 of 18 · 10 min

A local AI cluster topology defines how machines interconnect for distributed training and serving. The configuration affects throughput, latency, and failure handling differently than traditional HPC or web service clusters.

LLM workloads have distinct characteristics that inform topology decisions. Most time is spent in collective operations—all-reduce for tensor parallelism and all-gather for pipeline parallelism. Unlike traditional HPC simulations, LLM clusters experience high frequency of these collective calls. Network bandwidth and latency directly impact training convergence and serving latency.

Common topologies include fat-tree, dragonfly, and torus configurations. Fat-tree provides predictable bandwidth with commodity switches but introduces multiple network hops. Dragonfly uses high-radix routers to reduce hop count, improving collective performance at higher hardware cost. Torus topologies excel for the regular communication patterns in pipeline parallelism.

Local verification checkpoint

Run the smallest example from this chapter in a local workspace and record the package version, runtime, data path, and observed output. If the result depends on model size, vector count, CPU/GPU backend, or available memory, note that constraint beside the exercise so the lesson remains reproducible.

Local verification checkpoint

Run the smallest example from this chapter in a local workspace and record the package version, runtime, data path, and observed output. If the result depends on model size, vector count, CPU/GPU backend, or available memory, note that constraint beside the exercise so the lesson remains reproducible.

EXERCISE

Map your intended cluster scale to collective operation frequency. For tensor parallelism across N GPUs, estimate all-reduce bandwidth as (model size × gradient compression factor) ÷ collective_time. Cross-reference against your network switch capacity.