Skip to content

Feat Add Handling of Invalid Metric Due to Prometheus Con Issue

Mike Hanby requested to merge feat-handle-invalid-metric-from-prometheus into main

Gitlab Issue #4 (closed) Add Error Handling to uabrc_check_hw_context_switch_rate Curl Command

The existing check will undrain an unhealthy drained node in the case where curl can't reach the Prometheus server / metrics are missing from the current interval (node_exporter not running?).

In this scenario, it will return 0 when it fails because the curl output is piped to jq which succeeds. This gets passed back to NHC as success, which ultimately results in the node being undrained in Slurm (assuming another check didn't fail)

This is a first pass at fixing this issue, as this method has flaws.

The current fix jumps to the core NHC function nhcmain_finish, bypassing the code that would undrain the node, thus leaving it in whatever drain/undrain state that it’s currently in.

The downside is that it short circuits any checks that were supposed to be run after uabrc_check_hw_context_switch_rate. This can be mitigated by placing this check at the end of the nhc.conf file when running NHC in serial mode.

NHC also has a mode where it can fork all of the checks, I don’t suspect it will work in that case.

Edited by Mike Hanby

Merge request reports