Add Error Handling to uabrc_check_hw_context_switch_rate Curl Command
The existing check will undrain an unhealthy drained node in the case where curl
can't reach the Prometheus server / metrics are missing from the current interval (node_exporter not running?).
In this scenario, it will return 0
when it fails because the curl
output is piped to jq
which succeeds. Here's an example showing $?
is 0
on a failure
$ curl -fs --data-urlencode 'query=irate(node_context_switches_total{job="compute-node",name="c0236"}[1m])' http://nagios.rc.uab.edu:9090/api/v1/query | jq -r '.data.result[] | .value[1]'
$ echo $?
0
HW_CONTEXT_SWITCH_RATE
gets set to 0, which then causes the FAIL
code to be skipped, i.e. the node passed the health check, which ultimately causes NHC to undrain the node.
HW_CONTEXT_SWITCH_RATE=$(curl -fs --data-urlencode "query=irate(node_context_switches_total{job=\"compute-node\",name=\"$NODENAME\"}[$HW_CONTEXT_SWITCH_INTERVAL])" http://nagios.rc.uab.edu:9090/api/v1/query | jq -r '.data.result[] | .value[1]')
if [[ $((HW_CONTEXT_SWITCH_RATE)) -gt $CONTEXT_SWITCH_RATE_MAX ]]; then
die 1 "$FUNCNAME: Total Context Switches ($HW_CONTEXT_SWITCH_RATE) greater than maximum allowed ($CONTEXT_SWITCH_RATE_MAX)
return 1
fi