Add Error Handling to uabrc_check_hw_context_switch_rate Curl Command

The existing check will undrain an unhealthy drained node in the case where curl can't reach the Prometheus server / metrics are missing from the current interval (node_exporter not running?).

In this scenario, it will return 0 when it fails because the curl output is piped to jq which succeeds. Here's an example showing $? is 0 on a failure

$ curl -fs --data-urlencode 'query=irate(node_context_switches_total{job="compute-node",name="c0236"}[1m])' http://nagios.rc.uab.edu:9090/api/v1/query | jq -r '.data.result[] | .value[1]'
$ echo $?
0

HW_CONTEXT_SWITCH_RATE gets set to 0, which then causes the FAIL code to be skipped, i.e. the node passed the health check, which ultimately causes NHC to undrain the node.

HW_CONTEXT_SWITCH_RATE=$(curl -fs --data-urlencode "query=irate(node_context_switches_total{job=\"compute-node\",name=\"$NODENAME\"}[$HW_CONTEXT_SWITCH_INTERVAL])" http://nagios.rc.uab.edu:9090/api/v1/query | jq -r '.data.result[] | .value[1]')
if [[ $((HW_CONTEXT_SWITCH_RATE)) -gt $CONTEXT_SWITCH_RATE_MAX ]]; then
    die 1 "$FUNCNAME:  Total Context Switches ($HW_CONTEXT_SWITCH_RATE) greater than maximum allowed ($CONTEXT_SWITCH_RATE_MAX)
    return 1
fi
Assignee Loading
Time tracking Loading