• Test bed
  • In service
  • Discipline-specific system for Astronomy and Cosmology
  • Funded by STFC, DiRAC, ExCALIBUR
  • 1 nodes, with 8 AMD MI300X 192GB accelerators per node
  • Benchmarks (1) ▾
    • Memory bandwidth (BabelStream): 4036 GB/s
      • array_size: 134217728
      • iterations: 100
      • precision: FP64
  • Manufactured by AMD
  • Scheduler: Slurm
  • Interconnects:

COSMA MI300X

COSMA (The Compute Optimised System for Modelling and Analysis) is a High Performance Computing facility hosted at Durham University, operated by the Institute for Computational Cosmology on behalf of DiRAC.

The MI300X node is a GPU testbed within COSMA.

Node RAM CPU Access
ga007 2TB 96 cores (Intel Xeon Platinum 8468) Slurm (mi300x)

The MI300X is AMD’s data center GPU, optimised for LLM and genAI training and inference.

The MI300X node has 8x GPUs.

Documentation

Gaining access

Access requires a COSMA account, obtained via the DiRAC SAFE portal.

  1. Create a SAFE account with an institutional email.
  2. Upload an SSH public key on SAFE. If you do not have one, generate with ssh-keygen -t ed25519.
  3. Request a login account. This requires selecting a project, either:
    • Project do018 for AMD GPU testbed access.
    • A DiRAC project code for a given allocation (provided by a supervisor).
  4. Wait for the account to be approved by the project manager. Keep an eye on your email!
  5. Connect to COSMA via SSH: ssh username@login8.cosma.dur.ac.uk (Note: On first login you will be asked to change the password provided in your email)

Visit https://cosma.readthedocs.io/en/latest/account.html for more details. Contact cosma-support@durham.ac.uk for any questions.

Usage

Usage

Jobs are submitted via Slurm to the mi300x partition:

#!/bin/bash
#SBATCH --partition=mi300x
#SBATCH --account=do018
#SBATCH --time=01:00:00

rocm-smi
./gpu_program_to_run

For interactive access:

srun -p mi300x -A do018 -t 10 --pty /bin/bash

Restrictions

  • Nodes are non-exclusive by default (shared with other users). Use --exclusive if you require the entire node
  • The AMD ROCm software stack is installed. ROCm 7.2.0 is available at /opt/rocm-7.2.0/bin/hipcc
  • CUDA code must be converted to HIP using the hipify script provided with ROCm