Commits · main · rc / rc-nhc

Dec 11, 2024

Merge branch 'revert-memory-change-for-c0101-c0102' into 'main' · ab0c08e8
Mike Hanby authored 3 months ago
```
Revert memory change for c0101 c0102

See merge request !13
```
ab0c08e8

Revert memory change for c0101 c0102 · fb25c495

Mike Hanby authored 3 months ago

c0101 and c0102 (out of warranty) previously lost a stick of memory
(i.e. not visible to the OS). These nodes have since reverted back to
having all 256G visible.

This change reverts the previous change to nhc.conf so that the
check_hw_physmem is now 256gb like the other P100 nodes (c0097 .. 114)

fb25c495

Aug 02, 2024
- Merge branch 'update-gpfs5-nodes-nfs-mounts' into 'main' · e30da5ee
  Mike Hanby authored 8 months ago
```
Update for GPFS5 nodes c0202..c0219 NFS GPFs4 mounts

See merge request !12
```
  e30da5ee
- Update for GPFS5 nodes c0202..c0219 NFS GPFs4 mounts · 33a10b22
  Mike Hanby authored 8 months ago
```
- GPFS5 nodes have /data and /scratch mounted via NFS from GPFS4
- c0101 and c0102 - updated expected RAM
```
  33a10b22
Aug 17, 2023

Merge branch 'feat-handle-invalid-metric-from-prometheus' into 'main' · 30691eb0
Mike Hanby authored 1 year ago
```
Feat Add Handling of Invalid Metric Due to Prometheus Con Issue

See merge request !11
```
30691eb0

Feat Add Handling of Invalid Metric Due to Prometheus Con Issue · 6c40d217

Mike Hanby authored 1 year ago

#4

The existing check will undrain an unhealthy drained node in the case where curl can't reach the Prometheus server / metrics are missing from the current interval (node_exporter not running?).

In this scenario, it will return 0 when it fails because the curl output is piped to jq which succeeds. This gets passed back to NHC as success, which ultimately results in the node being undrained in Slurm (assuming another check didn't fail)

This is a first pass at fixing this issue, as this method has flaws.

The current fix jumps to the core NHC function `nhcmain_finish`, bypassing the code that would undrain the node, thus leaving it in whatever drain/undrain state that it’s currently in.

The downside is that it short circuits any checks that were supposed to be run after `uabrc_check_hw_context_switch_rate`. This can be mitigated by placing this check at the end of the `nhc.conf` file when running NHC in serial mode.

NHC also has a mode where it can fork all of the checks, I don’t suspect it will work in that case.

6c40d217

Aug 15, 2023

Merge branch 'rem-redundant-hostname-code' into 'main' · 5f7593ba
Mike Hanby authored 1 year ago
```
Rem Redundant Hostname Code in Script

See merge request !10
```
5f7593ba

Rem Redundant Hostname Code in Script · bb294cb1

Mike Hanby authored 1 year ago

NHC function [nhcmain_init_env()](https://github.com/mej/nhc/blob/master/nhc#L210) already initiallizes a variable containing the short hostname `HOSTNAME_S`.

The merge removes `NODENAME` and related code in favor of `$HOSTNAME_S`

bb294cb1

Aug 14, 2023
- Merge branch 'update-nhc-conf-fix-large-mem-nodes' into 'main' · 5f42a13e
  Mike Hanby authored 1 year ago
```
Fix Defs for Large Mem Nodes c0136..139

See merge request !9
```
  5f42a13e
- Fix Defs for Large Mem Nodes c0136..139 · 32dde549
  Mike Hanby authored 1 year ago
  
  32dde549
- Merge branch 'test-checkin-of-script' into 'main' · b9ac20aa
  Mike Hanby authored 1 year ago
```
Test checkin of script

See merge request !8
```
  b9ac20aa
- Test checkin of script · 23188c05
  Mike Hanby authored 1 year ago
  
  23188c05
- Merge branch 'mod-interval-context-switches' into 'main' · 37b5e002
  Mike Hanby authored 1 year ago
```
Mod context switch conf to use 5m interval

See merge request !7
```
  37b5e002
- Mod context switch conf to use 5m interval · e55537ee
  Mike Hanby authored 1 year ago
  
  e55537ee
- Merge branch 'update-context-switch-check-now-poll-prometheus' into 'main' · b101b11b
  Mike Hanby authored 1 year ago
```
Update Context Switch Check to use Prometheus as Datasource

See merge request !6
```
  b101b11b
- Update Context Switch Check to use Prometheus as Datasource · b7b596de
  Mike Hanby authored 1 year ago
```
The previous method of using data returned by node_exporter was invalid, as it returns a total since boot.

The new method queries Prometheus to get a rate change of the same metric:

```shell
  HW_CONTEXT_SWITCH_RATE=$(curl -fs --data-urlencode "query=irate(node_context_switches_total{job=\"compute-node\",name=\"$NODENAME\"}[$HW_CONTEXT_SWITCH_INTERVAL])" http://nagios.rc.uab.edu:9090/api/v1/query | jq -r '.data.result[] | .value[1]')
```
```
  b7b596de
- Merge branch 'fix-added-a100-nodes-to-cpu-core-check' into 'main' · 9d411ec3
  Mike Hanby authored 1 year ago
```
Fix A100 Nodes were missing from CPU Core Count Check

See merge request !5
```
  9d411ec3
- Fix A100 Nodes were missing from CPU Core Count Check · 08236223
  Mike Hanby authored 1 year ago
  
  08236223
Aug 10, 2023
- Merge branch 'add-a100-and-missing-nodes-to-nhc-conf' into 'main' · b154d1f9
  Mike Hanby authored 1 year ago
```
Add A100 and other Missing Nodes to NHC

See merge request !4
```
  b154d1f9
- Add A100 and other Missing Nodes to NHC · 59c670b2
  Mike Hanby authored 1 year ago
```
Add c0097 - c0201 and c0236 - c0255 to NHC

Add `/dev/nvidia*` checks
```
  59c670b2
Aug 08, 2023
- Merge branch 'init-checkin-context-switch-code' into 'main' · 19a265e1
  Mike Hanby authored 1 year ago
```
Init checkin of context switch check code

See merge request !3
```
  19a265e1
- Init checkin of context switch check code · ccd84c44
  Mike Hanby authored 1 year ago
  
  ccd84c44
Jul 10, 2023
- Added .gitignore · d631ecfe
  Mike Hanby authored 1 year ago
  
  d631ecfe
Jun 21, 2023
- Update README.md · 25d51904
  Mike Hanby authored 1 year ago
  
  25d51904
Jun 20, 2023
- Update README.md · 2e4680bc
  Mike Hanby authored 1 year ago
  
  2e4680bc
Jun 13, 2023
- Fixed typo in README ansible command · 561de532
  Mike Hanby authored 1 year ago
  
  561de532
- Merge branch 'add-v2gpu-nodes' into 'main' · b36f59db
  Mike Hanby authored 1 year ago
```
Add v2gpu compute nodes

See merge request rc/nhc!2
```
  b36f59db
Jun 12, 2023
- Add v2gpu compute nodes · 2d99bf39
  Mike Hanby authored 1 year ago
  
  2d99bf39
- Updated Readme with Deploy Instructs · 104f8732
  Mike Hanby authored 1 year ago
  
  104f8732
- Merge branch 'init-conf-files-checkin' into 'main' · 449aa938
  Mike Hanby authored 1 year ago
```
Init checkin of conf files

See merge request rc/nhc!1
```
  449aa938
- Init checkin of conf files · 33a08575
  Mike Hanby authored 1 year ago
  
  33a08575
- Initial commit · 93c91180
  Mike Hanby authored 1 year ago
  
  93c91180