04. Shared Storage

Chapter 4 of 18 · 10 min

Shared storage in an AI cluster enables model artifacts, check points, and training datasets to remain accessible across all compute nodes. The storage architecture directly impacts job scheduling flexibility, failure recovery time, and overall cluster utilization.

AI workloads present unique storage characteristics. Training jobs read entire datasets sequentially but write checkpoints infrequently. Serving workloads read model weights once during loading but require persistent access throughout job lifetimes. Inference serving generates logs and potential output datasets with patterns distinct from batch training.

Common storage architectures include NFS for simplicity and Ceph for scalable durability. NFS works for single-rack clusters with moderate storage throughput but breaks down under parallel training workloads where hundreds of processes read simultaneously. Ceph provides POSIX-compatible distributed storage with better parallelism but requires operational expertise and more infrastructure.

Practical recommendation: separate model storage from training data storage. Model weights are read-mostly and large, benefiting from local SSD caching with shared read access. Training data is read-heavy but smaller individual files, suitable for distributed filesystem approaches.

Local verification checkpoint

Run the smallest example from this chapter in a local workspace and record the package version, runtime, data path, and observed output. If the result depends on model size, vector count, CPU/GPU backend, or available memory, note that constraint beside the exercise so the lesson remains reproducible.

Local verification checkpoint

Run the smallest example from this chapter in a local workspace and record the package version, runtime, data path, and observed output. If the result depends on model size, vector count, CPU/GPU backend, or available memory, note that constraint beside the exercise so the lesson remains reproducible.

EXERCISE

Profile your storage access patterns by instrumenting your training script. Use strace -f -e trace=read,write during initial training epochs to identify file access frequency and size patterns. Design a storage architecture matching these patterns rather than assuming general-purpose approaches.