- Test bed
- In service
- Discipline-specific system for Astronomy and Cosmology
- Funded by STFC, DiRAC, ExCALIBUR
- 1 nodes, with 1 NVIDIA H100 NVL accelerator per node
- Benchmarks (1) ▾
-
Memory bandwidth (BabelStream):
3387
GB/s
- array_size: 134217728
- iterations: 100
- precision: FP64
- Manufactured by NVIDIA
- Scheduler: Direct SSH
- Interconnects:
COSMA GH200
COSMA (The Compute Optimised System for Modelling and Analysis) is a High Performance Computing facility hosted at Durham University, operated by the Institute for Computational Cosmology on behalf of DiRAC.
The H100 NVL (Hopper) node is a GPU testbed within COSMA.
| Node | RAM | CPU | Access |
|---|---|---|---|
| gn004 | 510GB | 64 cores (Intel Xeon) | Direct SSH |
The H100 NVL is a high-memory variant of NVIDIA’s H100. It is optimised for LLM Inference because of its high memory bandwidth, compute density, and energy efficiency.
Documentation
- https://cosma.readthedocs.io/en/latest/gpu.html#h100
- https://www.nvidia.com/content/dam/en-zz/Solutions/Data-Center/h100/PB-11773-001_v01.pdf
Gaining access
Access requires a COSMA account, obtained via the DiRAC SAFE portal.
- Create a SAFE account with an institutional email.
- Upload an SSH public key on SAFE. If you do not have one, generate with
ssh-keygen -t ed25519. - Request a login account. This requires selecting a project, either:
- Project
do016for NVIDIA GPU testbed access. - A DiRAC project code for a given allocation (provided by a supervisor).
- Project
- Wait for the account to be approved by the project manager. Keep an eye on your email!
- Connect to COSMA via SSH:
ssh username@login8.cosma.dur.ac.uk(Note: On first login you will be asked to change the password provided in your email)
Visit https://cosma.readthedocs.io/en/latest/account.html for more details. Contact cosma-support@durham.ac.uk for any questions.
Usage
Connect directly to gn004 via SSH from a login node:
ssh gn004
nvidia-smi
./gpu_program_to_run
Restrictions
- Shared resource. Before running large jobs, check if others are using the nodes. Access is SSH only, so there is no SLURM queue
- CUDA is available via
module load nvhpc/25.11