RUNLOCALAIv38
->Will it run?Best GPUCompareTroubleshootStartLearnPulseModelsHardwareToolsBench
Run check
RUNLOCALAI

Independently operated catalog for local-AI hardware and software. Hand-written verdicts. Source-cited claims. Reproducible commands when we have them.

OP·Fredoline Eruo
DIR
  • Models
  • Hardware
  • Tools
  • Benchmarks
TOOLS
  • Will it run?
  • Compare hardware
  • Cost vs cloud
  • Choose my GPU
  • Prompting kits
  • Quick answers
REF
  • All buyer guides
  • Learn local AI
  • Methodology
  • Glossary
  • Errors KB
  • Trust
EDITOR
  • About
  • Author
  • How we make money
  • Editorial policy
  • Contact
LEGAL
  • Privacy
  • Terms
  • Sitemap
MAIL · MONTHLY DIGEST
Get monthly local AI changes
Monthly recap. No spam.
DISCLOSURE

Some links on this site are affiliate links (Amazon Associates and other first-class retailers). When you buy through them, we earn a small commission at no extra cost to you. Affiliate links do not influence our verdicts — there are cards we rate highly that we don't have affiliate relationships with, and cards that sell well that we refuse to recommend. Read more →

© 2026 runlocalai.coIndependently operated
RUNLOCALAI · v38
  1. >
  2. Home
  3. /Learn
  4. /Courses
  5. /Local AI Clusters
  6. /Ch. 11
Local AI Clusters

11. Slurm for AI

Chapter 11 of 18 · 20 min
KEY INSIGHT

Slurm provides battle-tested scheduling for long-running GPU workloads with fair share scheduling, gang scheduling, and backfill optimization. The complexity lies in proper database configuration, partition definition, and understanding the distinction between `slurmctld` (controller) and `slurmd` (compute daemon) failure modes.

Workload managers like Slurm provide batch scheduling, job queuing, and resource allocation far beyond Kubernetes capabilities, making them standard for GPU clusters running AI training workloads.

Installing Slurm on Ubuntu

Slurm requires a database backend for accounting plus controller, scheduler, and compute daemon components:

apt-get install -y slurmd slurmctld mariadb-server

# Configure MariaDB for Slurm accounting
mysql -e "CREATE DATABASE slurm_acct_db;"
mysql -e "CREATE USER 'slurm'@'localhost' IDENTIFIED BY 'slurm_password';"
mysql -e "GRANT ALL PRIVILEGES ON slurm_acct_db.* TO 'slurm'@'localhost';"
mysql -e "FLUSH PRIVILEGES;"

The slurm.conf controls cluster behavior. A minimal configuration for a single-node AI cluster:

ClusterName=localai
ControlMachine=localhost
SlurmUser=root
SlurmctldPort=6817
SlurmdPort=6818
StateSaveLocation=/var/spool/slurm/ctld
SlurmdSpoolDir=/var/spool/slurm/d
TmpFS=/tmp
SchedulerType=sched/backfill
SelectType=select/cons_tres
SelectTypeParameters=CR_Core_Memory
MpiDefault=none
SlurmctldTimeout=300
SlurmdTimeout=300
InactiveLimit=0
MinJobAge=300
KillWait=30
Waittime=0
GresTypes=gpu,tmpdisk
NodeName=localhost[1-2]
Procs=8
RealMemory=64000
Sockets=2
CoresPerSocket=8
ThreadsPerCore=2
Gres=gpu:2
State=UNKNOWN

Multiple NodeName entries with [1-2] notation define nodes. Each node requires a separate slurmd daemon.

Starting and Verifying Services

systemctl enable slurmctld slurmd
systemctl start slurmctld
systemctl start slurmd

# Check cluster state
sinfo
# Expected output: showing nodes in down state initially
scontrol update node=localhost1 state=resume

Nodes appear down until explicitly resumed because State=UNKNOWN requires manual intervention on startup.

Submitting GPU Jobs

SLURM uses gres (generic resource) specifications for GPU allocation:

#!/bin/bash
#SBATCH --job-name=llama-finetune
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=8
#SBATCH --mem=64G
#SBATCH --gres=gpu:1
#SBATCH --time=04:00:00
#SBATCH --output=%j.out

module load cuda/12.1
srun python train.py --config configs/llama3-8b.yaml

Submit with sbatch script.sh and monitor with squeue. Failure to allocate GPUs manifests as REQ_NOT_AVAIL errors in the job log.

Common Slurm Failures

Jobs failing to start typically trace to misconfigured partition definitions or resource conflicts:

# Diagnose job failure
scontrol show job $JOB_ID

# Typical fix: partition not defined
cat <<EOF >> /etc/slurm/slurm.conf
PartitionName=compute Nodes=localhost[1-2] Default=YES MaxTime=24:00:00 State=UP
EOF
systemctl restart slurmctld

The accounting database connection failures produce cryptic SLURMCTLD pod restarts. Verify credentials match between slurm.conf and the database grants.

EXERCISE

Install Slurm in a single-node configuration, define a two-node cluster with different GPU counts, submit a batch job requesting a specific GPU, and verify the allocation with scontrol show job.

← Chapter 10
NVIDIA GPU Operator
Chapter 12 →
Model Repository