11. Slurm for AI
Workload managers like Slurm provide batch scheduling, job queuing, and resource allocation far beyond Kubernetes capabilities, making them standard for GPU clusters running AI training workloads.
Installing Slurm on Ubuntu
Slurm requires a database backend for accounting plus controller, scheduler, and compute daemon components:
apt-get install -y slurmd slurmctld mariadb-server
# Configure MariaDB for Slurm accounting
mysql -e "CREATE DATABASE slurm_acct_db;"
mysql -e "CREATE USER 'slurm'@'localhost' IDENTIFIED BY 'slurm_password';"
mysql -e "GRANT ALL PRIVILEGES ON slurm_acct_db.* TO 'slurm'@'localhost';"
mysql -e "FLUSH PRIVILEGES;"
The slurm.conf controls cluster behavior. A minimal configuration for a single-node AI cluster:
ClusterName=localai
ControlMachine=localhost
SlurmUser=root
SlurmctldPort=6817
SlurmdPort=6818
StateSaveLocation=/var/spool/slurm/ctld
SlurmdSpoolDir=/var/spool/slurm/d
TmpFS=/tmp
SchedulerType=sched/backfill
SelectType=select/cons_tres
SelectTypeParameters=CR_Core_Memory
MpiDefault=none
SlurmctldTimeout=300
SlurmdTimeout=300
InactiveLimit=0
MinJobAge=300
KillWait=30
Waittime=0
GresTypes=gpu,tmpdisk
NodeName=localhost[1-2]
Procs=8
RealMemory=64000
Sockets=2
CoresPerSocket=8
ThreadsPerCore=2
Gres=gpu:2
State=UNKNOWN
Multiple NodeName entries with [1-2] notation define nodes. Each node requires a separate slurmd daemon.
Starting and Verifying Services
systemctl enable slurmctld slurmd
systemctl start slurmctld
systemctl start slurmd
# Check cluster state
sinfo
# Expected output: showing nodes in down state initially
scontrol update node=localhost1 state=resume
Nodes appear down until explicitly resumed because State=UNKNOWN requires manual intervention on startup.
Submitting GPU Jobs
SLURM uses gres (generic resource) specifications for GPU allocation:
#!/bin/bash
#SBATCH --job-name=llama-finetune
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=8
#SBATCH --mem=64G
#SBATCH --gres=gpu:1
#SBATCH --time=04:00:00
#SBATCH --output=%j.out
module load cuda/12.1
srun python train.py --config configs/llama3-8b.yaml
Submit with sbatch script.sh and monitor with squeue. Failure to allocate GPUs manifests as REQ_NOT_AVAIL errors in the job log.
Common Slurm Failures
Jobs failing to start typically trace to misconfigured partition definitions or resource conflicts:
# Diagnose job failure
scontrol show job $JOB_ID
# Typical fix: partition not defined
cat <<EOF >> /etc/slurm/slurm.conf
PartitionName=compute Nodes=localhost[1-2] Default=YES MaxTime=24:00:00 State=UP
EOF
systemctl restart slurmctld
The accounting database connection failures produce cryptic SLURMCTLD pod restarts. Verify credentials match between slurm.conf and the database grants.
Install Slurm in a single-node configuration, define a two-node cluster with different GPU counts, submit a batch job requesting a specific GPU, and verify the allocation with scontrol show job.