Context

Developing performant accelerated software requires learning a wide range of skills, many of which build on each other. In this graph, we outline the building blocks of skills and knowledge needed, and what their dependencies are, so that you can plan your learning journey.

This diagram is a work-in-progress; we welcome suggestions on skills missing from it, and how they connect.

    graph LR
    
    gpuArch1["`Basic GPU architecture`"]
    
    cuda1["`Introduction to CUDA`"]
    
    hip1["`Introduction to HIP`"]
    
    slp["`GPU programming using standard language methods`"]
    
    omp["`OpenMP for GPU programming`"]
    
    kokkos["`Kokkos for performance-portable GPU programming`"]
    
    sycl["`SYCL for performance-portable GPU programming`"]
    
    alpaka["`Alpaka for performance-portable GPU programming`"]
    
    multicuda["`Shared-memory multi-GPU programing with CUDA`"]
    
    mpi["`Multi-node GPU programming`"]
    
    cudaArch["`NVIDIA device architecture`"]
    
    amdArch["`AMD device architecture`"]
    
    intelArch["`Intel device architecture`"]
    
    multihip["`Shared-memory multi-GPU programming with HIP`"]
    
    
    
    
    
    gpuArch1 --> cuda1
    
    
    
    gpuArch1 --> hip1
    
    
    
    gpuArch1 --> slp
    
    
    
    gpuArch1 --> omp
    
    
    
    gpuArch1 --> kokkos
    
    
    
    gpuArch1 --> sycl
    
    
    
    gpuArch1 --> alpaka
    
    
    
    cuda1 --> multicuda
    
    
    
    multicuda --> mpi
    
    multihip --> mpi
    
    
    
    cuda1 --> cudaArch
    
    
    
    hip1 --> amdArch
    
    
    
    sycl --> intelArch
    
    
    
    hip1 --> multihip
    
    
    
    click gpuArch1 "#gpuArch1"
    
    click cuda1 "#cuda1"
    
    click hip1 "#hip1"
    
    click slp "#slp"
    
    click omp "#omp"
    
    click kokkos "#kokkos"
    
    click sycl "#sycl"
    
    click alpaka "#alpaka"
    
    click multicuda "#multicuda"
    
    click mpi "#mpi"
    
    click cudaArch "#cudaArch"
    
    click amdArch "#amdArch"
    
    click intelArch "#intelArch"
    
    click multihip "#multihip"
    

Basic GPU architecture

Before we can start programming GPU accelerators (of any brand), we need an understanding of some basic aspects of the hardware that affect how we need to structure our code and our thinking. Many introductions to GPU programming (for example, in CUDA and HIP) touch on this topic.

Back to the roadmap

Introduction to CUDA

CUDA is the C-based language developed by NVIDIA for programming their GPU accelerators. Introductory lessons assume no existing knowledge of GPU programming, introduce writing and executing kernels, and managing data transfer between host and device.

Builds on:

Assumed prior (non-accelerator) knowledge:

  • C

Back to the roadmap

Introduction to HIP

HIP is the C-based language developed by AMD for programming their GPU accelerators. Introductory lessons assume no existing knowledge of GPU programming, introduce writing and executing kernels, and managing data transfer between host and device.

Builds on:

Assumed prior (non-accelerator) knowledge:

  • C

Back to the roadmap

GPU programming using standard language methods

Many GPU accelerators allow programming using parallelism constructs available as a standard part of programming languages. While this might not be as performant as using dedicated GPU languages or libraries, the barrier to entry is much lower. Lessons here should cover the basics of any parallel constructs unfamiliar to typical language users, before showing how they extend to running on GPU accelerators.

Builds on:

Assumed prior (non-accelerator) knowledge:

  • C, Fortran, or Python

Back to the roadmap

OpenMP for GPU programming

OpenMP is a programming model for multithreaded parallel programming. Since version 4, it supports offloading to accelerators, including GPUs. Lessons here should assume no prior knowledge of OpenMP, introducing it from scratch but focusing on the aspects necessary for GPU offload.

Builds on:

Assumed prior (non-accelerator) knowledge:

  • C or Fortran

Back to the roadmap

Kokkos for performance-portable GPU programming

Kokkos is a performance portable programming interface and library for C++, allowing the same program to target many programming models. This means that the same program can run on CPU and many different types of GPU accelerator, without needing to write specific additional instructions to support the GPU and manage communication with it.

Builds on:

Back to the roadmap

SYCL for performance-portable GPU programming

SYCL is a performance portable programming interface and library for C++, allowing the same program to target many programming models. This means that the same program can run on CPU and many different types of GPU accelerator, without needing to write specific additional instructions to support the GPU and manage communication with it.

(Since Intel currently does not have a compute-focused GPU accelerator on the market, and have ended their contract with Codeplay, who developed SYCL support for non-Intel platforms, and since SYCL support is not prioritised by non-Intel GPU accelerator manufacturers, the benefit of developing skills in this direction is currently limited.)

Builds on:

Assumed prior (non-accelerator) knowledge:

  • C++

Back to the roadmap

Alpaka for performance-portable GPU programming

Alpaka is a performance portable programming interface and library for C++, allowing the same program to target many programming models. This means that the same program can run on CPU and many different types of GPU accelerator, without needing to write specific additional instructions to support the GPU and manage communication with it.

Builds on:

Assumed prior (non-accelerator) knowledge:

  • C++

Back to the roadmap

Shared-memory multi-GPU programing with CUDA

When programming for a single GPU accelerator, the only data transfers that need to be considered are from the CPU memory to the accelerator’s, and back again. Once a program needs to use multiple accelerators, one must also consider transfers from one accelerator to another. NVIDIA have technologies to make this much faster than transferring via the host; lessons here cover how to use CUDA to utilise this functionality.

Builds on:

Back to the roadmap

Multi-node GPU programming

When GPU accelerated applications need to scale beyond a single node, it becomes increasingly challenging to maintain a high level of performance, particularly when data must be transferred from an accelerator on one node to another attached to a different node. Lessons here cover the available technologies to do this performantly, including accelerator-aware MPI, and how this interacts with the underlying libraries such as UCX and OpenFabrics.

Builds on:

Assumed prior (non-accelerator) knowledge:

  • MPI

Relevant training includes:

Back to the roadmap

NVIDIA device architecture

While a basic understanding of the structure of GPU accelerators and the CUDA language allows one to write accelerated code, a more detailed understanding of the underlying hardware enables more targeted optimisations to be made, which can have a significant effect on performance. Lessons in this area talk in more depth about the specifics of how NVIDIA GPU accelerators work internally, and how programs can take advantage of this.

Builds on:

Back to the roadmap

AMD device architecture

While a basic understanding of the structure of GPU accelerators and the HIP language allows one to write accelerated code, a more detailed understanding of the underlying hardware enables more targeted optimisations to be made, which can have a significant effect on performance. Lessons in this area talk in more depth about the specifics of how AMD GPU accelerators work internally, and how programs can take advantage of this.

Builds on:

Relevant training includes:

Back to the roadmap

Intel device architecture

While a basic understanding of the structure of GPU accelerators and the SYCL language allows one to write accelerated code, a more detailed understanding of the underlying hardware enables more targeted optimisations to be made, which can have a significant effect on performance. Lessons in this area talk in more depth about the specifics of how Intel GPU accelerators work internally, and how programs can take advantage of this.

Builds on:

Back to the roadmap

Shared-memory multi-GPU programming with HIP

When programming for a single GPU accelerator, the only data transfers that need to be considered are from the CPU memory to the accelerator’s, and back again. Once a program needs to use multiple accelerators, one must also consider transfers from one accelerator to another. AMD have technologies to make this much faster than transferring via the host; lessons here cover how to use HIP to utilise this functionality.

Builds on:

Back to the roadmap