Admin message

Gitlab has been upgraded from 18.2.8 to 18.9.5. See the following release notes:

Gitlab 18.3 Gitlab 18.4 Gitlab 18.5 Gitlab 18.6 Gitlab 18.7 Gitlab 18.8 Gitlab 18.9

A100 `amperenodes` communication to User Community

TODO

For this issue

  • ETA of release
  • Review of hpc-announce email

Timeline

This is tentative and subject to change.

  • 2023-07-27: Physical installation complete
  • 2023-08-04: Slurm/node configuration complete (first pass)
  • 2023-08-09: Single-GPU testing complete
  • 2023-09-25: Necessary fixes complete
  • 2023-09-25: Release

For A100s generally

  • Plan for local NVMe drives
    • Mike proposed RAID 0 striping of the two drives (performance?)
    • Mount path would be /local
  • Remaining tasks for A100s
    • Node definitions in slurm.conf
    • QoS definitions in slurm.conf
    • Consistent shell variable for /local
    • Validating A100s
    • Testing A100s ()
    • Performance comparison of A100s to P100s
    • Add amperenodes to live OOD: #461 (closed)
  • CUDA: rc/cluster-software#103
    • At least CUDA/toolkit >= 11.8
    • Ideally >= 12.0
    • cuDNN compiled against CUDA/toolkit
    • tensorrt compiled against CUDA/toolkit [optional]

Release coordination

  • slurm.conf - remove restritions on access for amperenodes* partitions - Done https://gitlab.rc.uab.edu/rc/rc-slurm/-/merge_requests/38
  • Add amperenodes* partitions to OOD Prod - #461 (closed)
  • Communications Prepared
    • Shell MOTD
    • OOD MOTD
    • Docs Announcement
    • Docs Pages
    • HPC Announce
  • Remove reservation in scontrol
  • Release HPC Announce
  • Notify and close relevant ServiceNow tickets

See the wiki page for current state of information to communicate.

Edited Sep 25, 2023 by William E Warriner
Assignee Loading
Time tracking Loading