Skip to content
  • Mike Hanby's avatar
    Feat Add Handling of Invalid Metric Due to Prometheus Con Issue · 6c40d217
    Mike Hanby authored
    #4
    
    The existing check will undrain an unhealthy drained node in the case where curl can't reach the Prometheus server / metrics are missing from the current interval (node_exporter not running?).
    
    In this scenario, it will return 0 when it fails because the curl output is piped to jq which succeeds. This gets passed back to NHC as success, which ultimately results in the node being undrained in Slurm (assuming another check didn't fail)
    
    This is a first pass at fixing this issue, as this method has flaws.
    
    The current fix jumps to the core NHC function `nhcmain_finish`, bypassing the code that would undrain the node, thus leaving it in whatever drain/undrain state that it’s currently in.
    
    The downside is that it short circuits any checks that were supposed to be run after `uabrc_check_hw_context_switch_rate`. This can be mitigated by placing this check at the end of the `nhc.conf` file when running NHC in serial mode.
    
    NHC also has a mode where it can fork all of the checks, I don’t suspect it will work in that case.
    6c40d217