Open
Milestone

Migrate Slurm Server off of Cheaha Master

We long wanted to move the Slurm server workload away from the Cheaha master node for a number of reasons; performance, upgradeability, recoverability...

We now have production OpenStack and Ceph S3 storage clusters that will make this possible without requiring outside resources like VMware VCenter, if we choose to keep the deployment in house.

  • Performance: The Cheaha master node, MySQL database and Slurm server services can all get busy performing their respective roles, contending for the same resources affecting the overall performance of the clients (compute nodes, OOD, others...).
  • Upgradeability: Upgrading Slurm via BrightCM requires downtime as we have to remove packages from the master node along with compute node reboots. Additionally, upgrading BrightCM stack interrupts Slurm services.
  • Recoverability: Moving Slurm services onto its own server / VM / Docker should make backing up and restoring much easier. Additionally, it will make it easier to stand up dev copies of the Slurm server.

Slurm Upgrade Instructions from Bright Computing

Please find the following documentation helpful in upgrading from 18.08 to 20.02 and then to 20.11

  • Let's first disable the workload manager
wlm-setup -d -w slurm
  • Remove the packages from the headnode and softwareimage
rpm -qa | grep slurm | xargs -p rpm -e
rpm -qa -r /cm/images/default-image |grep slurm |xargs -p rpm -r /cm/images/default-image -e
  • Install the new packages
yum install slurm20-client slurm20-slurmdbd slurm20-perlapi slurm20-contribs slurm20
yum install --installroot=/cm/images/default-image slurm20-client

Enable the WLM

[root@bright82 ~]# wlm-setup -e -w slurm
  • Confirm the queues are retained and working
  • The next steps are the same with the exception of the version change.
wlm-setup -d -w slurm

rpm -qa | grep slurm | xargs -p rpm -e
rpm -qa -r /cm/images/default-image |grep slurm |xargs -p rpm -r /cm/images/default-image -e

yum install slurm20.11-client slurm20.11-slurmdbd slurm20.11-perlapi slurm20.11-contribs slurm20.11
yum install --installroot=/cm/images/default-image slurm20.11-client

The compute nodes will need rebooting to pick up the newly adjusted softwareimage.

With regards to the FrozenFiles, these changes will be unique to the environment and will need to be analyzed individually for the reason implemented.

Kind regards,

Sage

  • Work items 0
  • Merge requests 0
  • Participants 0
  • Labels 0
Loading
Loading
Loading
Loading
0% complete
0%
Start date
No start date
None
Due date
No due date
0
Work items 0
Open: 0 Closed: 0
0
Merge requests 0
Open: 0 Closed: 0 Merged: 0
0
Releases
None
Reference: rc%"Migrate Slurm Server off of Cheaha Master"