• Test bed
  • In service
  • Discipline-specific system for Astronomy and Cosmology
  • Funded by STFC, DiRAC, ExCALIBUR
  • Partitions
    • 1 nodes, with 1 NVIDIA GH200 accelerator per node
      • Manufactured by NVIDIA
      • Scheduler: Direct SSH
    • 1 nodes, with 1 NVIDIA GH200 accelerator per node
      • Benchmarks (1) ▾
        • Memory bandwidth (BabelStream): 3500 GB/s
          • array_size: 134217728
          • iterations: 100
          • precision: FP64
      • Manufactured by NVIDIA
      • Scheduler: Slurm
  • Interconnects: NVLink-C2C

COSMA GH200

COSMA (The Compute Optimised System for Modelling and Analysis) is a High Performance Computing facility hosted at Durham University, operated by the Institute for Computational Cosmology on behalf of DiRAC.

The GH200 (Grace Hopper) nodes are GPU testbeds within COSMA.

Node RAM CPU Access
gn002 480GB (unified) 72 cores (ARM Grace) Direct SSH
gn003 480GB (unified) 72 cores (ARM Grace) Slurm (gracehopper)

The GH200 is NVIDIA’s “Superchip”; It achieves a CPU+GPU coherent memory model by combing an NVIDIA H100 GPU and an NVIDIA Grace CPU using NVIDIA NVLINK-C2C.

As the Grace CPU is Arm-based (aarch64), x64 binaries will not run. Code must be compiled on the node itself.

Documentation

Gaining access

Access requires a COSMA account, obtained via the DiRAC SAFE portal.

  1. Create a SAFE account with an institutional email.
  2. Upload an SSH public key on SAFE. If you do not have one, generate with ssh-keygen -t ed25519.
  3. Request a login account. This requires selecting a project, either:
    • Project do016 for NVIDIA GPU testbed access.
    • A DiRAC project code for a given allocation (provided by a supervisor).
  4. Wait for the account to be approved by the project manager. Keep an eye on your email!
  5. Connect to COSMA via SSH: ssh username@login8.cosma.dur.ac.uk (Note: On first login you will be asked to change the password provided in your email)

Visit https://cosma.readthedocs.io/en/latest/account.html for more details. Contact cosma-support@durham.ac.uk for any questions.

Usage

For gn002, connect directly via SSH from a login node: ssh mad06

For gn003, jobs are submitted via Slurm to the gracehopper partition:

#!/bin/bash
#SBATCH --partition=gracehopper
#SBATCH --account=do016
#SBATCH --time=01:00:00

nvidia-smi # checks existence of GPU
./gpu_program_to_run

Restrictions

  • Maximum wall time: 3 days
  • x86 binaries will not run. Code must be compiled on the node itself
  • No system cmake on nodes. Install cmake using pip3 install --user cmake and add $HOME/.local/bin to PATH
  • CUDA is available at /usr/local/cuda-13.0/bin/nvcc