- Dec 11, 2024
-
-
Mike Hanby authored
Revert memory change for c0101 c0102 See merge request !13
-
Mike Hanby authored
c0101 and c0102 (out of warranty) previously lost a stick of memory (i.e. not visible to the OS). These nodes have since reverted back to having all 256G visible. This change reverts the previous change to nhc.conf so that the check_hw_physmem is now 256gb like the other P100 nodes (c0097 .. 114)
-
- Aug 02, 2024
-
-
Mike Hanby authored
Update for GPFS5 nodes c0202..c0219 NFS GPFs4 mounts See merge request !12
-
Mike Hanby authored
- GPFS5 nodes have /data and /scratch mounted via NFS from GPFS4 - c0101 and c0102 - updated expected RAM
-
- Aug 17, 2023
-
-
Mike Hanby authored
Feat Add Handling of Invalid Metric Due to Prometheus Con Issue See merge request !11
-
Mike Hanby authored
#4 The existing check will undrain an unhealthy drained node in the case where curl can't reach the Prometheus server / metrics are missing from the current interval (node_exporter not running?). In this scenario, it will return 0 when it fails because the curl output is piped to jq which succeeds. This gets passed back to NHC as success, which ultimately results in the node being undrained in Slurm (assuming another check didn't fail) This is a first pass at fixing this issue, as this method has flaws. The current fix jumps to the core NHC function `nhcmain_finish`, bypassing the code that would undrain the node, thus leaving it in whatever drain/undrain state that it’s currently in. The downside is that it short circuits any checks that were supposed to be run after `uabrc_check_hw_context_switch_rate`. This can be mitigated by placing this check at the end of the `nhc.conf` file when running NHC in serial mode. NHC also has a mode where it can fork all of the checks, I don’t suspect it will work in that case.
-
- Aug 15, 2023
-
-
Mike Hanby authored
Rem Redundant Hostname Code in Script See merge request !10
-
Mike Hanby authored
NHC function [nhcmain_init_env()](https://github.com/mej/nhc/blob/master/nhc#L210) already initiallizes a variable containing the short hostname `HOSTNAME_S`. The merge removes `NODENAME` and related code in favor of `$HOSTNAME_S`
-
- Aug 14, 2023
-
-
Mike Hanby authored
Fix Defs for Large Mem Nodes c0136..139 See merge request !9
-
Mike Hanby authored
-
Mike Hanby authored
Test checkin of script See merge request !8
-
Mike Hanby authored
-
Mike Hanby authored
Mod context switch conf to use 5m interval See merge request !7
-
Mike Hanby authored
-
Mike Hanby authored
Update Context Switch Check to use Prometheus as Datasource See merge request !6
-
Mike Hanby authored
The previous method of using data returned by node_exporter was invalid, as it returns a total since boot. The new method queries Prometheus to get a rate change of the same metric: ```shell HW_CONTEXT_SWITCH_RATE=$(curl -fs --data-urlencode "query=irate(node_context_switches_total{job=\"compute-node\",name=\"$NODENAME\"}[$HW_CONTEXT_SWITCH_INTERVAL])" http://nagios.rc.uab.edu:9090/api/v1/query | jq -r '.data.result[] | .value[1]') ```
-
Mike Hanby authored
Fix A100 Nodes were missing from CPU Core Count Check See merge request !5
-
Mike Hanby authored
-
- Aug 10, 2023
-
-
Mike Hanby authored
Add A100 and other Missing Nodes to NHC See merge request !4
-
Mike Hanby authored
Add c0097 - c0201 and c0236 - c0255 to NHC Add `/dev/nvidia*` checks
-
- Aug 08, 2023
-
-
Mike Hanby authored
Init checkin of context switch check code See merge request !3
-
Mike Hanby authored
-
- Jul 10, 2023
-
-
Mike Hanby authored
-
- Jun 21, 2023
-
-
Mike Hanby authored
-
- Jun 20, 2023
-
-
Mike Hanby authored
-
- Jun 13, 2023
-
-
Mike Hanby authored
-
Mike Hanby authored
Add v2gpu compute nodes See merge request rc/nhc!2
-
- Jun 12, 2023
-
-
Mike Hanby authored
-
Mike Hanby authored
-
Mike Hanby authored
Init checkin of conf files See merge request rc/nhc!1
-
Mike Hanby authored
-
Mike Hanby authored
-