01. Why Cluster?

Chapter 1 of 18 · 10 min

Training and serving large language models requires computational resources that exceed what a single machine can provide. A model with 70 billion parameters in half-precision format needs 140GB just to store the weights. Activation memory during inference pushes this requirement higher. No commercially available GPU—current datacenter cards included—can hold such a model in memory.

Cluster computing addresses this fundamental constraint. By distributing model parameters, activations, and computation across multiple machines, you can serve models that would otherwise be impossible. This matters for practical reasons: serving a 70B model at reasonable throughput requires multiple GPUs regardless of your goals.

The alternative is using smaller models. Quantized models or fine-tuned derivatives can run on fewer devices, often with acceptable quality for specific tasks. For general-purpose deployments where you cannot control input length or quality, larger models remain necessary.

Beyond memory, throughput drives cluster decisions. A single A100 processes tokens serially. Production inference workloads batching across concurrent requests exhaust throughput quickly. Horizontal scaling through a cluster multiplies tokens-per-second when a single machine plateaus.

Local verification checkpoint

Run the smallest example from this chapter in a local workspace and record the package version, runtime, data path, and observed output. If the result depends on model size, vector count, CPU/GPU backend, or available memory, note that constraint beside the exercise so the lesson remains reproducible.

Local verification checkpoint

Run the smallest example from this chapter in a local workspace and record the package version, runtime, data path, and observed output. If the result depends on model size, vector count, CPU/GPU backend, or available memory, note that constraint beside the exercise so the lesson remains reproducible.

EXERCISE

Profile your target model's memory requirements. Run nvidia-smi during inference with your model loaded at batch size 1. Calculate how many GPUs you need for your target throughput using token generation benchmarks. Document this as your baseline architecture.