Slurm for AI — Local AI Clusters (Chapter 11)

Workload managers like Slurm provide batch scheduling, job queuing, and resource allocation far beyond Kubernetes capabilities, making them standard for GPU clusters running AI training workloads.

Installing Slurm on Ubuntu

Slurm requires a database backend for accounting plus controller, scheduler, and compute daemon components:

apt-get install -y slurmd slurmctld mariadb-server

# Configure MariaDB for Slurm accounting
mysql -e "CREATE DATABASE slurm_acct_db;"
mysql -e "CREATE USER 'slurm'@'localhost' IDENTIFIED BY 'slurm_password';"
mysql -e "GRANT ALL PRIVILEGES ON slurm_acct_db.* TO 'slurm'@'localhost';"
mysql -e "FLUSH PRIVILEGES;"

The slurm.conf controls cluster behavior. A minimal configuration for a single-node AI cluster:

ClusterName=localai
ControlMachine=localhost
SlurmUser=root
SlurmctldPort=6817
SlurmdPort=6818
StateSaveLocation=/var/spool/slurm/ctld
SlurmdSpoolDir=/var/spool/slurm/d
TmpFS=/tmp
SchedulerType=sched/backfill
SelectType=select/cons_tres
SelectTypeParameters=CR_Core_Memory
MpiDefault=none
SlurmctldTimeout=300
SlurmdTimeout=300
InactiveLimit=0
MinJobAge=300
KillWait=30
Waittime=0
GresTypes=gpu,tmpdisk
NodeName=localhost[1-2]
Procs=8
RealMemory=64000
Sockets=2
CoresPerSocket=8
ThreadsPerCore=2
Gres=gpu:2
State=UNKNOWN

Multiple NodeName entries with [1-2] notation define nodes. Each node requires a separate slurmd daemon.

Starting and Verifying Services

systemctl enable slurmctld slurmd
systemctl start slurmctld
systemctl start slurmd

# Check cluster state
sinfo
# Expected output: showing nodes in down state initially
scontrol update node=localhost1 state=resume

Nodes appear down until explicitly resumed because State=UNKNOWN requires manual intervention on startup.

Submitting GPU Jobs

SLURM uses gres (generic resource) specifications for GPU allocation:

#!/bin/bash
#SBATCH --job-name=llama-finetune
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=8
#SBATCH --mem=64G
#SBATCH --gres=gpu:1
#SBATCH --time=04:00:00
#SBATCH --output=%j.out

module load cuda/12.1
srun python train.py --config configs/llama3-8b.yaml

Submit with sbatch script.sh and monitor with squeue. Failure to allocate GPUs manifests as REQ_NOT_AVAIL errors in the job log.

Common Slurm Failures

Jobs failing to start typically trace to misconfigured partition definitions or resource conflicts:

# Diagnose job failure
scontrol show job $JOB_ID

# Typical fix: partition not defined
cat <<EOF >> /etc/slurm/slurm.conf
PartitionName=compute Nodes=localhost[1-2] Default=YES MaxTime=24:00:00 State=UP
EOF
systemctl restart slurmctld

The accounting database connection failures produce cryptic SLURMCTLD pod restarts. Verify credentials match between slurm.conf and the database grants.