Cosma Gh200

Test bed
In service
Discipline-specific system for Astronomy and Cosmology
Funded by STFC, DiRAC, ExCALIBUR
Partitions

1 nodes, with 1 NVIDIA GH200 accelerator per node

Manufactured by NVIDIA
Scheduler: Direct SSH

1 nodes, with 1 NVIDIA GH200 accelerator per node

Benchmarks ▴ (1) ▾

Memory bandwidth (BabelStream): 3500 GB/s
- array_size: 134217728
- iterations: 100
- precision: FP64

Manufactured by NVIDIA
Scheduler: Slurm

Interconnects: NVLink-C2C

COSMA GH200

COSMA (The Compute Optimised System for Modelling and Analysis) is a High Performance Computing facility hosted at Durham University, operated by the Institute for Computational Cosmology on behalf of DiRAC.

The GH200 (Grace Hopper) nodes are GPU testbeds within COSMA.

Node	RAM	CPU	Access
gn002	480GB (unified)	72 cores (ARM Grace)	Direct SSH
gn003	480GB (unified)	72 cores (ARM Grace)	Slurm (`gracehopper`)

The GH200 is NVIDIA’s “Superchip”; It achieves a CPU+GPU coherent memory model by combing an NVIDIA H100 GPU and an NVIDIA Grace CPU using NVIDIA NVLINK-C2C.

As the Grace CPU is Arm-based (aarch64), x64 binaries will not run. Code must be compiled on the node itself.

Documentation

Gaining access

Access requires a COSMA account, obtained via the DiRAC SAFE portal.

Create a SAFE account with an institutional email.
Upload an SSH public key on SAFE. If you do not have one, generate with ssh-keygen -t ed25519.
Request a login account. This requires selecting a project, either:
- Project do016 for NVIDIA GPU testbed access.
- A DiRAC project code for a given allocation (provided by a supervisor).
Wait for the account to be approved by the project manager. Keep an eye on your email!
Connect to COSMA via SSH: ssh username@login8.cosma.dur.ac.uk (Note: On first login you will be asked to change the password provided in your email)

Visit https://cosma.readthedocs.io/en/latest/account.html for more details. Contact cosma-support@durham.ac.uk for any questions.

Usage

For gn002, connect directly via SSH from a login node: ssh mad06

For gn003, jobs are submitted via Slurm to the gracehopper partition:

#!/bin/bash
#SBATCH --partition=gracehopper
#SBATCH --account=do016
#SBATCH --time=01:00:00

nvidia-smi # checks existence of GPU
./gpu_program_to_run

Restrictions

Maximum wall time: 3 days
x86 binaries will not run. Code must be compiled on the node itself
No system cmake on nodes. Install cmake using pip3 install --user cmake and add $HOME/.local/bin to PATH
CUDA is available at /usr/local/cuda-13.0/bin/nvcc