Migrate Slurm Server off of Cheaha Master
Milestone ID: 137
We long wanted to move the Slurm server workload away from the Cheaha master node for a number of reasons; performance, upgradeability, recoverability...
We now have production OpenStack and Ceph S3 storage clusters that will make this possible without requiring outside resources like VMware VCenter, if we choose to keep the deployment in house.
- Performance: The Cheaha master node, MySQL database and Slurm server services can all get busy performing their respective roles, contending for the same resources affecting the overall performance of the clients (compute nodes, OOD, others...).
- Upgradeability: Upgrading Slurm via BrightCM requires downtime as we have to remove packages from the master node along with compute node reboots. Additionally, upgrading BrightCM stack interrupts Slurm services.
- Recoverability: Moving Slurm services onto its own server / VM / Docker should make backing up and restoring much easier. Additionally, it will make it easier to stand up dev copies of the Slurm server.
Slurm Upgrade Instructions from Bright Computing
Please find the following documentation helpful in upgrading from 18.08 to 20.02 and then to 20.11
- Let's first disable the workload manager
wlm-setup -d -w slurm
- Remove the packages from the headnode and softwareimage
rpm -qa | grep slurm | xargs -p rpm -e rpm -qa -r /cm/images/default-image |grep slurm |xargs -p rpm -r /cm/images/default-image -e
- Install the new packages
yum install slurm20-client slurm20-slurmdbd slurm20-perlapi slurm20-contribs slurm20 yum install --installroot=/cm/images/default-image slurm20-client
Enable the WLM
[root@bright82 ~]# wlm-setup -e -w slurm
- Confirm the queues are retained and working
- The next steps are the same with the exception of the version change.
wlm-setup -d -w slurm rpm -qa | grep slurm | xargs -p rpm -e rpm -qa -r /cm/images/default-image |grep slurm |xargs -p rpm -r /cm/images/default-image -e yum install slurm20.11-client slurm20.11-slurmdbd slurm20.11-perlapi slurm20.11-contribs slurm20.11 yum install --installroot=/cm/images/default-image slurm20.11-client
The compute nodes will need rebooting to pick up the newly adjusted softwareimage.
With regards to the FrozenFiles, these changes will be unique to the environment and will need to be analyzed individually for the reason implemented.