Skip to content
Snippets Groups Projects
  1. Dec 11, 2024
  2. Aug 02, 2024
  3. Aug 17, 2023
    • Mike Hanby's avatar
      Merge branch 'feat-handle-invalid-metric-from-prometheus' into 'main' · 30691eb0
      Mike Hanby authored
      Feat Add Handling of Invalid Metric Due to Prometheus Con Issue
      
      See merge request !11
      30691eb0
    • Mike Hanby's avatar
      Feat Add Handling of Invalid Metric Due to Prometheus Con Issue · 6c40d217
      Mike Hanby authored
      #4
      
      The existing check will undrain an unhealthy drained node in the case where curl can't reach the Prometheus server / metrics are missing from the current interval (node_exporter not running?).
      
      In this scenario, it will return 0 when it fails because the curl output is piped to jq which succeeds. This gets passed back to NHC as success, which ultimately results in the node being undrained in Slurm (assuming another check didn't fail)
      
      This is a first pass at fixing this issue, as this method has flaws.
      
      The current fix jumps to the core NHC function `nhcmain_finish`, bypassing the code that would undrain the node, thus leaving it in whatever drain/undrain state that it’s currently in.
      
      The downside is that it short circuits any checks that were supposed to be run after `uabrc_check_hw_context_switch_rate`. This can be mitigated by placing this check at the end of the `nhc.conf` file when running NHC in serial mode.
      
      NHC also has a mode where it can fork all of the checks, I don’t suspect it will work in that case.
      6c40d217
  4. Aug 15, 2023
  5. Aug 14, 2023
  6. Aug 10, 2023
  7. Aug 08, 2023
  8. Jul 10, 2023
  9. Jun 21, 2023
  10. Jun 20, 2023
  11. Jun 13, 2023
  12. Jun 12, 2023
Loading