A100 `amperenodes` communication to User Community
TODO
For this issue
-
ETA of release -
Review of hpc-announce email
Timeline
This is tentative and subject to change.
-
2023-07-27: Physical installation complete -
2023-08-04: Slurm/node configuration complete (first pass) -
2023-08-09: Single-GPU testing complete -
2023-09-25: Necessary fixes complete -
2023-09-25: Release
For A100s generally
-
Plan for local NVMe drives -
Mike proposed RAID 0 striping of the two drives (performance?) -
Mount path would be /local
-
-
Remaining tasks for A100s -
Node definitions in slurm.conf
-
QoS definitions in slurm.conf
-
Consistent shell variable for /local
-
Validating A100s -
Testing A100s () -
Performance comparison of A100s to P100s -
Add amperenodes
to live OOD: #461 (closed)
-
-
CUDA: rc/cluster-software#103 -
At least CUDA/toolkit >= 11.8 -
Ideally >= 12.0 -
cuDNN compiled against CUDA/toolkit -
tensorrt compiled against CUDA/toolkit [optional]
-
Release coordination
-
slurm.conf
- remove restritions on access foramperenodes*
partitions - Done https://gitlab.rc.uab.edu/rc/rc-slurm/-/merge_requests/38 -
Add amperenodes*
partitions to OOD Prod - #461 (closed) - Communications Prepared
-
Shell MOTD -
OOD MOTD -
Docs Announcement -
Docs Pages -
HPC Announce
-
-
Remove reservation in scontrol
-
Release HPC Announce -
Notify and close relevant ServiceNow tickets
See the wiki page for current state of information to communicate.
Edited by William E Warriner