Update uabrc_check_hw_context_switch_rate error handling
The underlying issue, nodes would drain with reason NHC: Script timed out while executing "uabrc_check_hw_context_switch_rate 300000 5m" when the Prometheus server was not responding. This new code is supposed to address this.
- Moved
uabrc_hw.nhcto thescripts/directory to for better organization - Updated
uabrc_hw.nhcto better handle situations were the Prometheus server is not healthy/responsive - Updated
nhc.confto add two new arguments passed touabrc_check_hw_context_switch_rate
The Prometheus healthcheck uses the http://$PROMETHEUS_SRV:$PROMETHEUS_PORT/-/healthy endpoint in the new uabrc_hw_prom_srv_health(). If it fails, the function returns PROMETHEUS_IS_HEALTHY back to uabrc_check_hw_context_switch_rate. If PROMETHEUS_IS_HEALTHY is -ne 0 then the function exits back to NHC with a call to nhcmain_finish which exists the loop without draining the node.
TODO: What happens when the server responds to the following with a message other than Prometheus Server is Healthy.?
$ curl http://$PROMETHEUS_SRV:$PROMETHEUS_PORT/-/healthy
Prometheus Server is Healthy.
The updated script also takes two new arguments PROMETHEUS_SRV and PROMETHEUS_PORT that need to be passed via nhc.conf:
* || uabrc_check_hw_context_switch_rate 300000 5m grafana.ops.rc.uab.edu 9090