Heavy Watal

Slurm

Migrated from TORQUE and OpenPBS.

Installation

Common for both controler and compute nodes:

sudo apt updata && apt list --upgradable
sudo apt upgrade
sudo apt install munge slurmd

Set NTP server and time zone:

sudo timedatectl set-timezone Asia/Tokyo
sudo vim /etc/systemd/timesyncd.conf
[Time]
NTP=YOUR_NTP_SERVER_HERE
sudo systemctl restart systemd-timesyncd
systemctl status systemd-timesyncd
timedatectl timesync-status

Controller node

sudo apt install slurm-wlm slurmctld

Create /etc/munge/munge.key:

sudo -u munge /usr/sbin/mungekey -v

Create /etc/slurm/cgroup.conf. It can be empty, but you may want to try resource constraints:

ConstrainCores=yes
ConstrainDevices=yes
ConstrainRAMSpace=yes

Use /usr/share/doc/slurmctld/slurm-wlm-configurator.easy.html to generate an almost-default /etc/slurm/slurm.conf.

Prepare directories for log files and state files:

sudo mkdir -p /var/lib/slurm/slurmctld
sudo mkdir -p /var/lib/slurm/slurmd
sudo mkdir -p /var/log/slurm
sudo chown -R slurm:slurm /var/lib/slurm
sudo chown -R slurm:slurm /var/log/slurm

Set permissions for configuration files:

sudo chown -R slurm:slurm /etc/slurm
sudo chmod 644 /etc/slurm/slurm.conf
sudo chmod 644 /etc/slurm/cgroup.conf

Enable and start services:

sudo systemctl enable --now munge slurmd slurmctld
systemctl status munge slurmd slurmctld

Copy munge.key, slurm.conf, and cgroup.conf to some NFS-shared directory, e.g., /home/admin/etc/slurm/, to distribute them to compute nodes later.

Compute node

Prepare directories for log files and state files:

sudo mkdir -p /var/lib/slurm/slurmd
sudo mkdir -p /var/log/slurm
sudo chown -R slurm:slurm /var/lib/slurm
sudo chown -R slurm:slurm /var/log/slurm

Copy munge.key, slurm.conf, and cgroup.conf from the controler node:

cd /home/admin/etc/slurm/

sudo cp munge.key /etc/munge/
sudo chmod 400 /etc/munge/munge.key
sudo chown munge:munge /etc/munge/munge.key

sudo cp slurm.conf /etc/slurm/
sudo cp cgroup.conf /etc/slurm/
sudo chmod 644 /etc/slurm/*.conf

Enable and start services:

sudo systemctl enable --now munge slurmd
systemctl status munge slurmd

Check

sinfo
scontrol show nodes
systemctl status munge slurmd slurmctld
sudo less /var/log/slurm/slurmctld.log
sudo less /var/log/slurm/slurmd.log

You may need to update the state manually:

sudo scontrol reconfigure
sudo scontrol update nodename=compute01 state=resume