sacctmgr configuration fails with slurm database unreachable message
During the initialvagrant up ohpc
step, the configuration step for sacctmgr fails with the following message:
ohpc: TASK [ohpc_install : load sacctmgr config] *************************************
ohpc: fatal: [ohpc]: FAILED! => {"changed": true, "cmd": ["sacctmgr", "-i", "load", "/etc/slurm/sacctmgr-heirarchy.cfg"], "delta": "0:00:00.009189", "end": "2019-03-18 18:47:12.880612", "msg": "non-zero return code", "rc": 1, "start": "2019-03-18 18:47:12.871423", "stderr": "sacctmgr: error: slurm_persist_conn_open_without_init: failed to open persistent connection to ohpc:7031: Connection refused\nsacctmgr: error: slurmdbd: Sending PersistInit msg: Connection refused\nsacctmgr: error: Problem talking to the database: Connection refused", "stderr_lines": ["sacctmgr: error: slurm_persist_conn_open_without_init: failed to open persistent connection to ohpc:7031: Connection refused", "sacctmgr: error: slurmdbd: Sending PersistInit msg: Connection refused", "sacctmgr: error: Problem talking to the database: Connection refused"], "stdout": "", "stdout_lines": []}
ohpc: to retry, use: --limit @/vagrant/CRI_XCBC/site.retry
ohpc:
ohpc: PLAY RECAP *********************************************************************
ohpc: ohpc : ok=38 changed=35 unreachable=0 failed=1
The SSH command responded with a non-zero exit status. Vagrant
assumes that this means the command failed. The output for this command
should be in the log above. Please read the output to determine what
went wrong.
The database did appear to start successfully, although with some warning messages:
[root@ohpc ~]# cat /var/log/slurmdbd.log
[2019-03-18T17:53:36.384] error: Database settings not recommended values: innodb_buffer_pool_size innodb_log_file_size innodb_lock_wait_timeout
[2019-03-18T17:53:36.604] converting QOS table
[2019-03-18T17:53:36.604] Conversion done: success!
[2019-03-18T17:53:36.606] error: chdir(/var/log): Permission denied
[2019-03-18T17:53:36.606] chdir to /var/tmp
[2019-03-18T17:53:36.607] slurmdbd version 18.08.6 started
And it does seem to be listening:
[root@ohpc ~]# lsof -i :7031
COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME
slurmdbd 9746 slurm 6u IPv4 43260 0t0 TCP *:iposplanet (LISTEN)
but sinfo is unable to contact the slurm controller:
[root@ohpc ~]# sinfo
slurm_load_partitions: Unable to contact slurm controller (connect failure)