Improve SLURM scheduler performance
We need to ensure we are able to run the latest version of slurm in a performant way that handles the job loads we are seeing appear on the cluster.
At present we have a 20k job limit and 10k array limit. We have observed that two 10k array jobs can trigger a denial of service that prevents other users from submitting new jobs until the total drops below 20k.
We have also witnessed extremely sluggish slurmctld behavior when the regular job load (not large array jobs) exceeds 10k, so our 20k limit is not possible, at least for non-array jobs.
Ideally can migrate to externalizing the operation of slurm from the BCM masters. This would run slurm as a service on dedicated resources (VM or HW depending on performance), potentially with dedicated nodes for slurmctld, slurmdbd, and mysql. For example, these VMs could be in OpenStack or hardware provisioned by MAAS.
This deployment approach would allow us to more easily track slurm releases from upstream and not depend on BCM for slurm deployment.