GPU Programming Knowledge Graph
Context
Developing performant accelerated software requires learning a wide range of skills, many of which build on each other. In this graph, we outline the building blocks of skills and knowledge needed, and what their dependencies are, so that you can plan your learning journey.
This diagram is a work-in-progress; we welcome suggestions on skills missing from it, and how they connect.
graph LR
gpuArch1["`Basic GPU architecture`"]
cuda1["`Introduction to CUDA`"]
hip1["`Introduction to HIP`"]
slp["`GPU programming using standard language methods`"]
omp["`OpenMP for GPU programming`"]
kokkos["`Kokkos for performance-portable GPU programming`"]
sycl["`SYCL for performance-portable GPU programming`"]
alpaka["`Alpaka for performance-portable GPU programming`"]
multicuda["`Shared-memory multi-GPU programing with CUDA`"]
mpi["`Multi-node GPU programming`"]
cudaArch["`NVIDIA device architecture`"]
amdArch["`AMD device architecture`"]
intelArch["`Intel device architecture`"]
multihip["`Shared-memory multi-GPU programming with HIP`"]
gpuArch1 --> cuda1
gpuArch1 --> hip1
gpuArch1 --> slp
gpuArch1 --> omp
gpuArch1 --> kokkos
gpuArch1 --> sycl
gpuArch1 --> alpaka
cuda1 --> multicuda
multicuda --> mpi
multihip --> mpi
cuda1 --> cudaArch
hip1 --> amdArch
sycl --> intelArch
hip1 --> multihip
click gpuArch1 "#gpuArch1"
click cuda1 "#cuda1"
click hip1 "#hip1"
click slp "#slp"
click omp "#omp"
click kokkos "#kokkos"
click sycl "#sycl"
click alpaka "#alpaka"
click multicuda "#multicuda"
click mpi "#mpi"
click cudaArch "#cudaArch"
click amdArch "#amdArch"
click intelArch "#intelArch"
click multihip "#multihip"
Basic GPU architecture
Before we can start programming GPU accelerators (of any brand), we need an understanding of some basic aspects of the hardware that affect how we need to structure our code and our thinking. Many introductions to GPU programming (for example, in CUDA and HIP) touch on this topic.
Introduction to CUDA
CUDA is the C-based language developed by NVIDIA for programming their GPU accelerators. Introductory lessons assume no existing knowledge of GPU programming, introduce writing and executing kernels, and managing data transfer between host and device.
Builds on:
Assumed prior (non-accelerator) knowledge:
- C
Introduction to HIP
HIP is the C-based language developed by AMD for programming their GPU accelerators. Introductory lessons assume no existing knowledge of GPU programming, introduce writing and executing kernels, and managing data transfer between host and device.
Builds on:
Assumed prior (non-accelerator) knowledge:
- C
GPU programming using standard language methods
Many GPU accelerators allow programming using parallelism constructs available as a standard part of programming languages. While this might not be as performant as using dedicated GPU languages or libraries, the barrier to entry is much lower. Lessons here should cover the basics of any parallel constructs unfamiliar to typical language users, before showing how they extend to running on GPU accelerators.
Builds on:
Assumed prior (non-accelerator) knowledge:
- C, Fortran, or Python
OpenMP for GPU programming
OpenMP is a programming model for multithreaded parallel programming. Since version 4, it supports offloading to accelerators, including GPUs. Lessons here should assume no prior knowledge of OpenMP, introducing it from scratch but focusing on the aspects necessary for GPU offload.
Builds on:
Assumed prior (non-accelerator) knowledge:
- C or Fortran
Kokkos for performance-portable GPU programming
Kokkos is a performance portable programming interface and library for C++, allowing the same program to target many programming models. This means that the same program can run on CPU and many different types of GPU accelerator, without needing to write specific additional instructions to support the GPU and manage communication with it.
Builds on:
SYCL for performance-portable GPU programming
SYCL is a performance portable programming interface and library for C++, allowing the same program to target many programming models. This means that the same program can run on CPU and many different types of GPU accelerator, without needing to write specific additional instructions to support the GPU and manage communication with it.
(Since Intel currently does not have a compute-focused GPU accelerator on the market, and have ended their contract with Codeplay, who developed SYCL support for non-Intel platforms, and since SYCL support is not prioritised by non-Intel GPU accelerator manufacturers, the benefit of developing skills in this direction is currently limited.)
Builds on:
Assumed prior (non-accelerator) knowledge:
- C++
Alpaka for performance-portable GPU programming
Alpaka is a performance portable programming interface and library for C++, allowing the same program to target many programming models. This means that the same program can run on CPU and many different types of GPU accelerator, without needing to write specific additional instructions to support the GPU and manage communication with it.
Builds on:
Assumed prior (non-accelerator) knowledge:
- C++
Shared-memory multi-GPU programing with CUDA
When programming for a single GPU accelerator, the only data transfers that need to be considered are from the CPU memory to the accelerator’s, and back again. Once a program needs to use multiple accelerators, one must also consider transfers from one accelerator to another. NVIDIA have technologies to make this much faster than transferring via the host; lessons here cover how to use CUDA to utilise this functionality.
Builds on:
Multi-node GPU programming
When GPU accelerated applications need to scale beyond a single node, it becomes increasingly challenging to maintain a high level of performance, particularly when data must be transferred from an accelerator on one node to another attached to a different node. Lessons here cover the available technologies to do this performantly, including accelerator-aware MPI, and how this interacts with the underlying libraries such as UCX and OpenFabrics.
Builds on:
Assumed prior (non-accelerator) knowledge:
- MPI
Relevant training includes:
- GPU-aware MPI with ROCm: AMD blog article introducing GPU-aware MPI
NVIDIA device architecture
While a basic understanding of the structure of GPU accelerators and the CUDA language allows one to write accelerated code, a more detailed understanding of the underlying hardware enables more targeted optimisations to be made, which can have a significant effect on performance. Lessons in this area talk in more depth about the specifics of how NVIDIA GPU accelerators work internally, and how programs can take advantage of this.
Builds on:
AMD device architecture
While a basic understanding of the structure of GPU accelerators and the HIP language allows one to write accelerated code, a more detailed understanding of the underlying hardware enables more targeted optimisations to be made, which can have a significant effect on performance. Lessons in this area talk in more depth about the specifics of how AMD GPU accelerators work internally, and how programs can take advantage of this.
Builds on:
Relevant training includes:
- AMD blogs on the SHAREing website: A number of deep-dive articles on specialised AMD accelerator topics.
Intel device architecture
While a basic understanding of the structure of GPU accelerators and the SYCL language allows one to write accelerated code, a more detailed understanding of the underlying hardware enables more targeted optimisations to be made, which can have a significant effect on performance. Lessons in this area talk in more depth about the specifics of how Intel GPU accelerators work internally, and how programs can take advantage of this.
Builds on:
Shared-memory multi-GPU programming with HIP
When programming for a single GPU accelerator, the only data transfers that need to be considered are from the CPU memory to the accelerator’s, and back again. Once a program needs to use multiple accelerators, one must also consider transfers from one accelerator to another. AMD have technologies to make this much faster than transferring via the host; lessons here cover how to use HIP to utilise this functionality.
Builds on: