1 Introduction

The modelling of interaction of high energy particles with matter is central to studying particle physics. High energy particles (>100 MeV) are produced at the CERN Large Hadron Collider and interact with detectors surrounding the points where protons collide. The modelling is performed by the GEANT4 package, a C++ code developed over the last 20 years and is also used in application such as nuclear physics, medical physics and space physics.

A GPU implantation of the code is high priority as it allows to use the latest HPC resources. The difficulty with GPU code re-engineering is that as particles interact with matter, they can create new particles that need to be propagated through detectors, which leads to thread divergence.

A GPU implantation of interactions is being developed by the Celeritas team, based at the Oak Ridge laboratory in the US in collaboration with colleagues at Warwick Universities. A recent review conducted at CERN demonstrated a good physics performance of this implementation and an initial integration with the ATLAS experiment software.

In this mini-project, we used the existing celeritas v0.6.3 code to:

Assess the performance of the code, via the use of profilers to identify potential bottleneck (see Section 2)
Assess the impact of the use of single vs double precision floating points (see Section 3)
Assess the ability to run code on AMD GPUs compared to the NVIDIA CUDA GPUs (see Section 4)

1.1 Celeritas

Celeritas (Johnson et al. 2025) is a GPU accelerated particle transport code for High Energy Physics detectors simulation, implemented in C++, with CUDA for NVIDIA GPUs & ROCm for AMD GPUs (see Section 4 for further portability information).

It can be used as a library in other applications, as a standalone simulator (celer-sim), or as an integrated Geant4 application (celer-g4).

For a given simulation event, many independent particle tracks are concurrently simulated on the GPU within an iterative time-stepped algorithm. Within each time-step, each particle undergoes a number of simulated actions, depending on the simulation configuration.

In Celeritas, there are two implementations of model geometry (and navigation) available, which can be selected during configuration:

VecGeom (Apostolakis et al. 2015) a CUDA-only navigation library for Geant4 detector geometries, currently only compatible with NVIDIA GPUs.
ORANGE (Oak Ridge Advanced Nested Geometry Engine) - an in-development surface-based geometry navigation library, implemented for NVIDIA and AMD devices.

Installation of Celeritas for non-development use is trivially achieved through the use of Spack (with compilers and CUDA/ROCm known to spack):

spack install celeritas
spack load celeritas

Or if installing for development (e.g. profiling) or if you wish to enable non-default options, a spack environment description is provided (for CUDA systems) which installs all of the software dependencies prior to CMake being used to build Celeritas from source. These instructions can be found in the celeritas README.md.

1.2 Hardware & Software

Within this mini-project, a range of GPU systems available through the University of Sheffield were used, including local development workstations and UK Tier 2 HPC facilities.

Table 1: Hardware summary for GPU systems used within this project

Name	CPU	Memory	GPU(s)	GPU target
Blackmass (TUoS)	1x AMD Threadripper 2950X	32 GB	1x NVIDIA RTX 3080	`86`
Mavericks (TUoS)	1x AMD Intel i7-6850k	64 GB	3x NVIDIA Titan V	`70`
N8 CIR Bede GH200	1x NVIDIA Grace	480 GB	1x NVIDIA H100 96GB	`90`
Awe (TUoS)	1x AMD Threadripper 7960X	64 GB	2x AMD Radeon RX 7900 XTX	`gfx1100`
`JADE@ARC` / JADE 2.5	2x AMD EPYC 9534	2000 GB	8x AMD MI300X	`gfx942`

For consistency, the following software versions were used across all systems (except where mentioned otherwise):

CUDA 12.6
GCC 11
ROCm 6.4.2
Spack v1.1.0
Celeritas v0.6.3

2 Profiling & benchmarking celeritas

To assess the performance of the celeritas code on a range of NVIDIA GPUs, a high-level benchmark across a range of models was performed to capture high level information, followed by finer-grained profiling of individual simulations using NVIDIAs profiling tools, highlighting possible candidates for optimisation.

2.1 Hardware

In this section, the performance of celeritas was evaluated using the systems from Table 1 which contain NVIDIA GPUs:

Blackmass - a local workstation containing 1 NVIDIA RTX 3080 GPU
Mavericks - a local workstation containing 3 NVIDIA Titan V GPUs
N8CIR Bede - a UK Tier-2 HPC facility which includes a pilot resource of 9 NVIDIA Grace Hopper superchips (GH200).

Local HPC facilities at the University of Sheffield containing NVIDIA A100, H100 pcie and H100 NVL GPUs were not used as kernel-level profiling using Nsight Compute is not currently possible.

To simplify the setups, and make any benchmark comparisons more fair, only a single GPU will be used in each system.

2.2 Building Celeritas

On each of the 3 systems used, [spack][spack] v1.1.0 was used to install the development dependencies of celeritas v0.6.3, using CUDA 12.6 and GCC 11 (one of the compiler combinations tested under continuous integration for celeritas).

A CMakeUserPreset.json file was created for each user machine, defining optimised release builds using VecGeom and Orange for geometry, named ndebug and ndebug-novg. Each machine used the most appropriate CMAKE_CUDA_ARCHITECTURES variable for the system (70, 86 or 90 as appropriate).

Once created, the builds were performed using the scripts/build.sh build script on each system, for both build configurations. This script builds celeritas and executes the full test suite, which passed successfully on the RTX 3080 and Titan V machines. The test suite was not executed on the N8CIR Bede system as the node used for compilation was a CPU-only Grace-Grace node.

spack env activate celeritas
./scripts/builds.sh ndebug
./scripts/builds.sh ndebug-novg

2.3 Celeritas regression suite benchmarking on NVIDIA GPUs

The celeritas team have created the celeritas regression repository - a repository containing a suite of test problems to ensure that simulations run to completion, track how the input files and generated output changes for these problems over time, and how performance metrics change as the complexity of celeritas increases.

This is a useful tool to check that celeritas builds are behaving as intended, is a source for models known to work in celeritas for performance analysis, and is structured to allow comparison of performance over time.

It includes 8 different input geometries, which are used with different input parameters to produce a set of 14 different test problems which have different performance profiles. Each problem is then executed using different binaries, executing on the CPU, the GPU, the GPU with additional synchronisation or a mix of CPU and GPU.

The regression repository includes tools for analysis and plotting of the results, although only a subset of the generated plots will be used in this report.

In order to run the regression suite on a given machine:

The repository must be cloned, and input files downloaded through git-lfs. For this project, I was using 5176e85d commit of celeritas-project/regression.
Python dependencies numpy and scipy must be installed and available on the path for suite execution, while matplotlib, numpy & pandas must be available for plotting. If using Spack a spack environment for celeritas dependencies these may need adding to your spack environment.
Celeritas builds must be produced, including celer-sim and celer-g4.
run-problems.py must be modified to include the machine being used, extending the System class and included in the list of systems.
update-plots.py and summarize.py must be modified to include information about the system
Shell scripts to run the benchmark on your machine and rebuild the celeritas dependencies can be created.

The suite can then be executed by your run-<MACHINE>.sh script, or directly using run-problems.py.

Once complete, results can be plotted by executing the modified update-plots.py script which will produce a number of figures. For this project, update-plots has been modified to create the plots included in this report.

Figure 1: Per-GPU celeritas simulation throughput for the celeritas-project/regression benchmark set, executed on NVIDIA GH200, NVIDIA Titan V & NVIDIA RTX 3080 GPUs

Figure 1 shows the per-task (per GPU in a multi-GPU system) simulation throughput for each regression problem, on each system benchmarked. The number of markers per problem varies, as not all problems are supported by both ORANGE and VecGeom geometry backends in celeritas and in some cases runtime errors were encountered on some machines:

The CMS2018 geometry (\(C_{R2}M\) and \(C_{R2}F_{U}\)) is not currently supported by ORANGE so only VecGeom results are included
The 10GB device memory of the RTX 3080 GPU was insufficient for some problems with the scales used in the regression suite, including the Atlas Tile Calorimeter (\(A_{T}M\)), CMS HGCAL (\(C_{HG}M\)) and CMS2018 geometry (\(C_{R2}M\) and \(C_{R2}F_{U}\)) problems.
The 12GB device memory of the Titan V GPU was insufficient for the Atlas Tile Calorimeter (\(A_{T}M\)) and CMS HGCAL (\(C_{HG}M\)) problems when using ORANGE on the GPU.

The relative performance of each GPU model is roughly as one would expect - the consumer grade RTX 3080 GPU with the lowest FP64 performance (0.47 TFLOPS due to the 1:64 FP32 to FP64 arithmetic unit ratio) and relatively low global memory bandwidth has the lowest simulation throughput The Titan V GPU from 2017 with a 1:2 FP64 to FP32 ratio and peak FP64 performance of ~7.4 TFLOPS lies in the middle, while the newest GH200 GPU with a 1:2 FP64 to FP32 ratio and peak FP64 performance of ~34 TFLOPS and up to 4TB/s of global device memory bandwidth offers significantly higher throughput.

Figure 2: Per-Task CPU and GPU celeritas simulation throughput for the celeritas-project/regression benchmark set, executed on NVIDIA GH200, NVIDIA Titan V & NVIDIA RTX 3080 systems

Figure 2 includes the simulation throughput for the CPU-based simulations using the celer-sim and celer-g4 binaries. This highlights the difference in age and scale of the CPUs in the development machines, which are relatively old with low core counts, compared to the 72-core ARM CPU in the N8CIR Bede GH200 nodes. Additionally the number of cores being used had to be reduced to ensure that some tests did not hit out-of-memory errors. Although we are mostly interested in the GPU performance, knowing the CPU performance for the same workloads allows us to understand the relative improvement for a given system. These comparisons are not ideal however as the local development machines are all biased towards the GPUs (as they are GPU development machines).

Comparing the CPU and GPU builds of celer-sim as part of the regression suite post-processing scripts the relative performance improvement (speedup) of the GPU build compared to the CPU build is extracted:

The N8 CIR Bede GH200 system showed a GPU vs CPU performance improvement of between 0.4× and 3×, comparing the full 72 core CPU against the single GPU.
The NVIDIA Titan V GPU in Mavericks demonstrated speedups between 3.0× & 19×, comparing a single GPU to the full 6-core CPU.
The NVIDIA RTX 3080 system Blackmass showed relative performance improvements of between 1.4× and 4×, compared to the full 6-core CPU.

2.4 Profiling

Through the wide number of simulations carried out in Section 2.3, model inputs which are known to behave can then be isolated for individual profiling, to gain insight into where the application is spending it’s runtime and find candidates for optimisations.

There are many tools and techniques available for profiling applications. For the purposes of evaluating GPU accelerated code on NVIDIA GPUs, use of the Nsight family of tools provided by NVIDIA is often used. There are 2 main profiling tools in the Nsight family used for compute-related profiling (rather than 3D computer graphics), Nsight Systems and Nsight Compute.

NSight Systems captures system-wide application traces, allowing CPU and GPU activity to be visualised as the application progresses. This can identify where GPUs are or are not being utilised, and to identify which GPU kernels are the most time-consuming for a given simulation run and are likely candidates for optimisation.
Nsight Compute captures fine-grained profiling information about individual GPU kernels, including line-level profiling (if the appropriate compilation flags are used). This can show why the specific kernel took the time it did, and allows the user to identify how the kernel may be improved.

Other vendors provide similar tooling for profiling their own GPUs, for example AMD provide tools such as rocprof, the ROCm Systems Profiler, ROCm Compute Profiler and others. These tools are generally less mature than the CUDA tools but provide similar functionality to the NVIDIA tooling.

Celeritas also includes code to support and improve profiling activities. When the CELER_ENABLE_PROFILING environment variable is set (and assuming required libraries were available during compilation) NVTX or ROC-TX markers and ranges are created during the application run. These events and ranges significantly improve the profiling experience, by simplifying the use of profiling tools and enhancing the visualisation of application timelines with custom regions. For CPU-only builds perfetto events are embedded.

In this sub-section, profiling of some of the regression test cases is shown and described, showing the use of Nsight Systems traces for full simulations and then the use of Nsight Compute for kernel-level profiling.

2.4.1 Nsight Systems Application Timelines

Using the nsight-systems command line tool (nsys) the celer-sim binary was traced on each system for several input geometries. celer-sim json input files used were extracted from those used in Section 2.3, which have a maximum number of tracks 1048576 concurrent tracks (i.e. concurrent threads), with up to 16 events and up to 32768 steps per event. The number of steps could have been reduced to simplify profiling and reduce the size of profile traces.

CELER_ENABLE_PROFILING=1 \
nsys profile \
  --trace=cuda,nvtx,osrt \
  --osrt-backtrace-stack-size=16384 \
  --backtrace=fp \
  --force-overwrite true \
  -o timeline.nsys-rep \
  celer-sim input.json

Once complete, the generated .nsys-rep file was copied from the benchmark machine for local visualisation through nsys-ui. Figure 3 shows the Nsight Systems timeline for the \(T_{EM3}^{+}F_{U}M\) problem when executed on the GH200 GPUs in the N8 CIR Bede system with ORANGE for geometry processing. In this timeline view, the traced execution timeline moves from left to right, with different vertical regions showing each traced view, including CPU activity from the OSRT tracing, the CUDA GPU activity and the NVTX ranges inserted by celeritas as hierarchy of grey boxes. As celeritas is a time-stepped simulation, this screen shot of the fully zoomed-out timeline for the full 1766 step simulation hides a lot of detail, although the initialisation and input file loading phase of the application an occupying roughly 1/12 of the trace. As this only occurs once per run, and is predominantly on the CPU we can dismiss this region for now.

Figure 3: A screenshot of the Nsight Systems timeline for the \(T_{EM3}^{+}F_{U}M\) problem when executed on the N8 CIR Bede GH200 GPUs, fully zoomed-out showing the full timeline

Figure 4 shows the timeline for the same simulation, zoomed in to just the first 10 simulation timesteps. This is can be seen both in groups of repeated kernel launches in the CUDA HW region, and in the NVTX regions. The initial steps within celer-sim have shorter durations than latter steps, as either there are fewer active tracks currently being simulated, and/or the tracks are in a simpler area of geometry. This makes the very early steps less-interesting from a profiling perspective than from those in the middle of the simulation where the per-step duration is higher and more consistent. A similar phase of shorter-than average steps is observed at the end of the simulation, as not all tracks require the same number of simulation steps to resolve.

Figure 4: A screenshot of the Nsight Systems timeline for the \(T_{EM3}^{+}F_{U}M\) problem when executed on the N8 CIR Bede GH200 GPUs, zoomed in to show the first 10 simulation steps

Figure 5 shows a single typical timestep for this simulation. The step involves 23 kernel launches for this problem. The number of kernel launches per step varies between different problems, depending which actions are required for the input problem, which means the most time-consuming parts of a simulation can vary between input problems.

For this particular step (for this model on this GPU) a single kernel launch dominates the step. Unfortunately due to the use of C++ templating the kernel name is very long and difficult to read in the screenshots (but can be viewed in the UI), the NVTX regions near the bottom of the screen are offset due to the asynchronous nature of CUDA kernel launches, however if the per stream regions are expanded a correctly aligned version is presented. In this case, the propagate-uniform NVTX range is responsible for the kernel launch, and would be the most important kernel for further analysis as it represents the largest portion of execution time. The Nsight Systems UI also provides this information in the sorted list of kernels, with the top-most kernel in the UI list corresponding to the kernel with the largest total duration. The next most interesting kernels to evaluate in this model belong to the scatter-msc-urban, pre-step & along-step-neutral NVTX ranges. The full C++ templated kernel name for the propagate-uniform kernel in this case is:

void celeritas::detail::<unnamed>::launch_action_impl<celeritas::ConditionalTrackExecutor<celeritas::detail::IsAlongStepUniformField, celeritas::detail::PropagationApplier<celeritas::detail::UniformFieldPropagatorFactory, void>>, celeritas::detail::PropagationApplier<celeritas::detail::UniformFieldPropagatorFactory, void>, (bool)1>(celeritas::Range<celeritas::OpaqueId<celeritas::Thread_, unsigned int>>, T1)

Figure 5: A screenshot of the Nsight Systems timeline for the \(T_{EM3}^{+}F_{U}M\) problem when executed on the N8 CIR Bede GH200 GPUs, zoomed in to show a single timestep

This process of using the nsight systems timeline to identify where the time is spent for a simulation executed is then repeated across multiple problems, for multiple celeritas builds across multiple systems.

For the same \(T_{EM3}^{+}F_{U}M\) on the GH200 GPU but using VecGeom rather than ORANGE, the application timeline has a very similar structure, but with different kernel durations. propagate-uniform is once again the most-time consuming kernel and of the most interest, however the order of the next-most interesting kernels is different, with scatter-msc-urban, limit-step-msc-urban & pre-step being the next most time consuming kernels overall.

Figure 6: Combined screenshots of the Nsight Systems timeline for the \(T_{EM3}^{-}F_{U}\) problem when executed on the GH200, Titan V and RTX 3080 GPUs, showing the same simulation step which contains the longest-running kernel. Timelines were zoomed to the same scale showing roughly 76ms, however this is a manual mouse-based process in nsight systems so is not 100% accurate.

For the \(T_{EM3}^{-}F_{U}\) regression problem using ORANGE for geometry, the propagate-uniform process is once again the most time consuming by a large margin for the GH200 GPU. This is followed by pre-step and along-step-neutral. Figure 6 scaled timelines for the 3 NVIDIA systems profiled, containing a RTX 3080, Titan V & GH200 GPUs. This shows how the performance for this single time-consuming simulation step performs across the systems, with the relatively order of performance lining up with the wider benchmark results, and the same kernel being dominant across all 3, including the consumer RTX 3080 GPU which has significantly reduced double-precision compute performance.

2.4.2 Nsight Compute kernel profiling

From the use of Nsight Systems, we know that the propagate-uniform kernel is accounts for the majority of the GPU runtime for the test cases (\(T_{EM3}^{+}F_{U}M\) & \(T_{EM3}^{-}F_{U}\)) on the 3 Nvidia devices used (GH200, Titan V, Titan RTX). In order to attempt to find ways to improve this kernel, we gain finer-grained information about how this kernel is using the GPU hardware through kernel-level and line-level profiling. For NVIDIA GPUs (from the Volta architecture) the appropriate tool for this is Nsight Compute.

NSight compute can be used interactively, or can be used in a batch mode on the command line, capturing metrics and generating profile files for offline analyses. For the purposes of this benchmarking, as remote GPUs were in use batch-mode profiling was used, capturing the full set of performance metrics to ensure as much information was available as possible. Capturing the full set of metrics can take a significant amount of time, as the application is replayed many times. As we know that we are interested in a kernel within an NVTX region, we can use the --nvtx flag with the --nvtx-include <arg> argument to filter out to only capture relevant kernels by the NVTX region they are launched within, and use --launch-skip <arg> & --launch-count <arg> to target specific instances of kernels rather than profiling all of them. Additionally, when compiling celeritas -lineinfo was added to CMAKE_CUDA_FLAGS to ensure that line-level profiling information was captured. The command(s) used to profile celeritas were of the following form:

CELER_ENABLE_PROFILING=1 \
ncu --set=full \
  --nvtx --nvtx-include celeritas@celer-sim/step/*/propagate-uniform \
  --launch-skip 300 --launch-count 10 \
  --force-overwrite \
  -o metrics.ncu-rep \
  celer-sim input.json

Once complete, the ncu-rep files were copied to a local machine with Nsight Compute installed for interactive analysis.

Figure 7 shows the NSight Compute UI when a report is first opened, in this case for the propagate-uniform capture of the \(T_{EM3}^{+}F_{U}M\) test problem executed on an NVIDIA GH200 GPU. This includes a table of all of the individual kernel profiles in this report with an overview and optimisation suggestions.

Figure 7: A screenshot of the Nsight Compute Summary tab, for 10 instances of the `propagate-uniform` kernel for the \(T_{EM3}^{+}F_{U}M\) problem, when executed on the N8 CIR Bede GH200 GPU

After selecting the specific kernel instance to look at in more detail (typically the longest-running) the details tab is displayed. This details tab contains many sections each which can be expanded and contain information about this kernel invocation. Figure 8 shows this view for the same longest duration kernel from Figure 7. Subsequent sections will show more information about specific sections.

Figure 8: A screenshot of the Nsight Compute ‘details’ tab, for an instance of the `propagate-uniform` kernel for the \(T_{EM3}^{+}F_{U}M\) problem, when executed on the N8 CIR Bede GH200 GPU

With -lineinfo enabled, the line-level metrics can be viewed on the Source tab, however for remote profiles the mapping to local copies of source files is required. Figure 9 shows the source tab for Algorithms.hh selected, This part of the UI does not lend itself to screenshots.

Figure 9: A screenshot of the Nsight Compute ‘source’ tab, for an instance of the `propagate-uniform` kernel for the \(T_{EM3}^{+}F_{U}M\) problem, when executed on the N8 CIR Bede GH200 GPU

The following subsections will each discuss potential areas for performance improvement for the propagate-uniform kernel based on profiler output, although many additional detail is available than include here.

2.4.3 Kernel analysis - workload imbalance

Figure 10 shows the GPU Throughput plot from the GPU Speed of Light section of the Nsight Compute report for 2 instances of the propagate-uniform kernel executed on an NVIDIA GH200 GPU for the \(T_{EM3}^{+}F_{U}M\) inputs using ORANGE. Figure 10 (a) shows information for the kernel from the 411th timestep which executed in 2.93ms & Figure 10 (b) is for the the kernel from the 451st timestep, which took 9.83ms.

We can see that both of these kernels are latency bound, or the device is under-utilised, as both compute and memory throughput are low. As the kernel launches ~ 1 million threads, this should be sufficient to fully occupy any current NVIDIA GPU, however there may not be that many active tracks at any given time. The longer-running of the two kernels (step 451) shows reduced throughput for both metrics, which suggests workload imbalance may be a factor.

Figure 11 shows the PM sampling section of the Nsight Compute output for the same 2 kernel invocations from steps 411 (Figure 11 (a)) and 451 (Figure 11 (b)). This section shows various performance metrics which are sampled over the duration of the kernel execution, which can illustrate workload imbalance. Note that the scale of the x axis in these plots is different due to the varying kernel runtime.

For both kernel invocations, the SMs are active for the majority of the first ~2.25ms of the kernel, before the number of active SMs decreases. For the longer running kernel, the remaining 7ms has a low volume of SM active cycles. This suggests there is workload imbalance, with a small number of tracks (threads) which take more time to complete the propagation algorithm. This may be due to the position in a more complex region of the detector geometry. In some cases, this type of workload imbalance can be improved, however it is not always possible but may be worth further investigation. Alternatively, it may be possible to hide this workload imbalance through the use of multiple asynchronous kernel launches, using streams to allow the GPU to remain busy.

2.4.4 Kernel analysis - latency and occupancy

From Figure 10 we know that the propagate-uniform kernel is latency bound (for the specific simulation and GPU combination).

GPUs can hide sources of latency through hardware context-switching. Ideally there will be more resident warps (groups of 32 threads on current NVIDIA GPUs) than can be executed by each streaming multiprocessor concurrently. When a warp stalls for a memory access or similar to occur, a context-switch occurs to a different resident warp. This allows latencies to be hidden.

Occupancy is the ratio of the number of active warps per multiprocessor to the maximum number of possible active warps. Higher levels of occupancy allow for latency to be hidden (although a low-occupancy kernel is not guaranteed to be slower than a high occupancy kernel). The theoretical occupancy of a kernel can be limited by many factors, including the number of registers requires per thread, the block size and the the use of shared memory per block. Nsight Compute includes a section showing and visualising information related to occupancy, as shown by Figure 12 for the propagate-uniform kernel in step 451.

Figure 12: The Occupancy section of Nsight Compute for the `propagate-uniform` kernel from step 451 for the \(T_{EM3}^{+}F_{U}M\) test case on GH200 using ORANGE

For double-precisions codes such as celeritas with complex kernels, the number of registers required per thread is often very high, in this case with the ORANGE version of this kernel 208 out of a maximum 254 registers per thread are required. The equivalent kernel using VecGeom uses 254 registers per thread. This results in the worst-case 12.5% theoretical occupancy, which prevents the GPU from being able to hide latencies.

Reducing the number of registers per thread may allow for an increase in performance by hiding these latencies, however it can be challenging to reduce the number of registers per thread. Register usage is determined by the PTX assembler and optimiser, so changes which you might expect to change the number of registers may not have an impact. Smaller less-complex kernels, using 32-bit data types rather than 64-bit data types (which require more of the available register space) or enabling Link-Time Optimisation (LTO) may lead to reductions in the number of registers per thread.

With CUDA, it is also possible to limit the maximum number of threads that a kernel will be allocated by the PTX optimiser, using the -maxrregcount compiler flag, the __launch_bounds__ attribute or the __maxnreg__ function qualifier.

However, limiting the maximum numbers of registers per thread often does not result in a performance improvement, as the PTX optimiser may be forced to spill registers to local memory (a specialised region of global memory), which has a much higher latency than the register file, typically negating the benefits of higher occupancy.

Additionally, this section of the Nsight Compute report also highlights the workload imbalance of this kernel, indicated by the large difference between the 12.5% theoretical occupancy and 5.89% achieved occupancy.

2.4.4.1 CUDA 13 ‘spill to shared’

In CUDA 13.0, the PTX assembler has been extended with support for spilling registers to shared memory rather than local memory. On most recent NVIDIA architectures, shared memory resides within/alongside the L1 cache, with much lower latency than local memory. Combined with limiting the maximum number of registers per thread, forcing spills, this may enable a higher occupancy and enable latency hiding within Celeritas.

asm volatile (".pragma \"enable_smem_spilling\";");

In CUDA 13.0, there are a number of limitations which may prevent this functionality from being used - notably it should not be used with Relocatable Device Code (-rdc=true) which will prevent this from (currently) being applied within VecGeom builds of Celeritas.

This will potentially be of benefit to ORANGE-based implementations, and an investigation applying this to all kernels should be relatively simple to achieve, however, the Titan V gpus are not supported by CUDA 13.0, and the N8 CIR Bede system has not yet been upgraded to a CUDA 13.0 compatible driver so I was unable to test this on a GPU with high FP64 performance. If this does prove to be useful, it may need to be applied on a per-kernel basis. In an NVIDIA blog post, 5-10% performance improvements were demonstrated for kernels in a lattice QCD application (QUDA).

2.4.5 Kernel analysis - stalls

If the occupancy cannot be improved to hide latencies, then the sources of latency need to be understood and reduced.

The “Warp State Statistics” section of the nsight compute report shows the average number of cycles each warp is stalled for, for different categories of stall. The larger the number, the more performance impact by addressing the source of the stalls.

Figure 13 shows the diagram of warp state stalls for for the propagate-uniform kernel from step 451 for the \(T_{EM3}^{+}F_{U}M\) test case on GH200 using ORANGE. The most common stall is Stall Wait which are fixed-latency execution dependencies, which may just be caused by the use of instructions which have higher latencies. This could potentially be improved through --use-fast-math, however this may impact numerical correctness.

The second most-common source of stall events are long scoreboard stalls, which are memory-dependencies on the L1TEX operations (from local, global, surface or texture memory).

Any reduction in the average stall time would improve performance, especially for low-occupancy kernels. The sources of each type of stall can be found through the line-level profiling view, where the stall samples can be shown next to each line of SASS or PTX, and the corresponding lines of CUDA C++ source code.

Figure 13: The Warp State Statistics section of Nsight Compute for the `propagate-uniform` kernel from step 451 for the \(T_{EM3}^{+}F_{U}M\) test case on GH200 using ORANGE

2.4.6 Kernel analysis - RTX 3080 FP64

From the regression benchmarking runs we know that the RTX 3080 GPU which has lower double precision compute performance is slower for FP64 builds of celeritas than the other models of GPU. Profiling can be used to confirm our assumptions about the cause of a performance differences.

In this case, by comparing profile information between GPUs (using nsight compute’s baseline functionality), we can directly see this difference. Figure 14 shows the utilisation of different GPU hardware pipelines for the for the propagate-uniform kernel from for the \(T_{EM3}^{-}F_{U}M\) test case using ORANGE on RTX 3080 (blue), Titan V (green) and GH200 (purple) GPUs. By overlaying the profile information for the same kernel invocation over one another, we can see the difference in use of the available hardware. It is clear that for the RTX 3080 a larger fraction of the available FP64 hardware is in use throughout this kernel.

Figure 14: The ‘Compute Workload Analysis’ section of Nsight Compute for the `propagate-uniform` kernel from for the \(T_{EM3}^{-}F_{U}M\) test case using ORANGE on RTX 3080 (blue), Titan V (green) and GH200 (purple) GPUs

3 Floating point precision

The floating point formats used within GPU accelerated applications can have a significant impact on GPU performance, but depending on the application may also have a significant impact on correctness.

Traditionally GPUs were designed for use with single-precision (32-bit wide floating point numbers), as this is sufficient for typical graphics workloads, so most floating-point units for modern GPUs are for single-bit floats.

Current GPUs from the main vendors also include hardware units for double-precision (64-bit) floating point operations, but very often at a much reduced capacity compared to single-precision units. There is typically a 1:32 ratio of FP64 to FP32 units for consumer devices, and 1:2 for data-centre products designed for HPC usage (although these numbers vary between manufacturers and individual products).

More recently, for modern AI applications there has been a shift to include support for narrower and narrow precision floating point units, with NVIDIA Blackwell GPUs including hardware support for 4-bit floating point matrix and vector operations.

In HEP applications it is standard to use double precision floating point, in this section we investigate the impact of running celeritas 0.6.3 using single-precision floating point numbers.

3.1 Configuring floating point precision in celeritas

Celeritas includes a CMake option CELERITAS_REAL_TYPE to control the floating point precision used for real numbers, documented as an experimental feature:

Choose between double and float real numbers across the codebase.

This is currently experimental.

When configured to float, via -DCELERITAS_REAL_TYPE=float, this controls a c++ type-alias, so that real_type can be used throughout the celeritas code-base which corresponds to either double or float depending upon the build configuration.

This is the only change required to build celeritas in single precision if using the ORANGE geometry backend. If using VecGeom, the VecGeom build would also need to be a single-precision build, which celeritas checks for during CMake configuration. If this is not the case a configuration error will be raised. For this work, We have only investigated the ORANGE implementation (due to linker issues encountered when attempting to link VecGeom, although these may have been a result of user-error).

For this work, all builds were optimised release builds of celeritas v0.6.3 using CUDA 12.6 & GCC 11.

3.2 Single-precision compilation and test suite execution

As in Section 2, the 3 NVIDIA GPU systems listed in Table 1 were used to investigate FP32 Celeritas builds, using CUDA 12.6 and GCC 11. The celeritas scripts/build.sh script runs the test suite once the build completes, however some tests which are known to fail in single precision are disabled when configuring with CELERITAS_REAL_TYPE=float (e.g. app/celer-g4 which is commented as disabled because propagation fails in single precision). As it is an experimental feature this is not surprising.

On the RTX 3080 and Titan V systems compilation was successful and all tests passed.
On the GH200 N8CIR Bede system, compilation succeeded however the test suite was not executed due to the absence of a GPU in the Grace-Grace login node used for compilation.
In addition to the CUDA 12.6 builds, CUDA 13.0 was also tested on the RTX 3080, which compiled successfully and tests passed.

In Section 4 the use of AMD GPUs with celeritas is discussed. An attempt to compile Celeritas in single precision for AMD using ROCm 6.4.2 was made, however this was unsuccessful due to compilation errors related to __clang_hip_math.h functions. This may be an issue with specific ROCm versions, or just a configuration which has not been tested recently (as this functionality experimental).

3.3 Single-precision regression suite benchmarking

To check if celeritas would run real-world examples when compiled in single-precision, the celeritas regression suite (see Section 2.3) was executed using the FP32 builds, during which many issues were encountered.

Figure 15 shows the single-gpu regression suite simulation throughput for the original FP64 builds and FP32 builds for the RTX 3080 and Titan V systems. Only 4 simulations split across 3 models completed within the timeout and produced a converged result. For the Titan V which has a 1:2 FP64 to FP32 ratio, simulation throughput decreased. For the RTX 3080 which has a 1:64 FP64 to FP32 ratio, simulation performance increased for the 2 models which ran to completion. The GH200 GPU from the N8CIR Bede system is excluded, as no runs completed and converged in the timeout window.

Figure 15: Per-GPU celeritas simulation throughput for the celeritas-project/regression benchmark set, executed on NVIDIA RTX 3080 and NVIDIA Titan V GPUs using CUDA 12.6 in double and single precision. Successful FP32 runs were limited.

Models which were not executed or failed to complete had several different causes:

The CMS2018 geometry (\(C_{R2}M\) and \(C_{R2}F_{U}\)) is not currently supported by ORANGE so could not be executed in the ORANGE-only FP32 builds.
Several of the input files are converted from in-memory Geant4 models, which is not implemented for single-precision real values in Celeritas 0.6.3
Several \(T_{EM3}^-\) configurations encountered cudaErrorIllegalAddress: an illegal memory access was encountered errors reported within thrust::copy_if, although the error may be from an a prior asynchronous kernel launch.
The regression benchmark suite is configured with a timeout of 600s for any individual simulation, which CPU and GPU simulations achieved.
some CPU simulations raised track failed to cross local errors.

Figure 16 includes the CPU-based simulations for the FP64 and FP32 builds above, which shows that the CPU simulations (celer-sim in executing on the CPU) also encountered issues for many problem sets.

Figure 16: Celeritas simulation throughput for the celeritas-project/regression benchmark set, executed on CPU and GPU using the NVIDIA RTX 3080 and NVIDIA Titan V systems using CUDA 12.6 in double and single precision.

3.3.1 Profiling in FP32

When looking attempting to generate profile traces for the functional models in FP32, models would either hang indefinitely or encounter CUDA errors, limiting the amount of profiling information which could be extracted to evaluate if further FP32 would would lead to a definite performance improvement if the issues can be resolved.

When limiting the simulation to only run up to 10 very short nsight-systems timelines could be produced, although as the number of concurrent tracks is low the runtime of the included kernels is short. For the propagate-uniform kernel previously shown to be the most time-consuming and therefore interesting for optimisation, in FP32 this kernel uses 128 registers per thread on the RTX 3080 system which is the same as in FP64 builds, suggesting there is no improvement to occupancy for this kernel when targetting Compute Capability 86 GPUs. However register use is impacted for other kernels, such as scatter-msc-urban which reduces from 64 register per thread to 48 register per thread, increasing theoretical occupancy from 66.67% to 83.33%. This shows that one of the potential performance impacting factors is improved for some kernels, which is one of the expected benefits of moving to reduced precision, however register usage is determined by the PTX assembler which can be difficulty to directly influence.

For GPUs with reduced FP64 units, the switch to FP32 would have a larger impact, but more work is required before this can be verified for the experimental use of 32-bit precision real numbers in Celeritas.

3.4 Conclusion

As of Celeritas v0.6.3, single precision support is experimental. When using ORANGE in single-precision, parts of the test suite which have not been excluded due to known issues run as expected, while full simulations are more likely to encounter errors than not, although there is potential performance and hardware cost savings in this area if issues can be resolved The use single precision with VecGeom-based builds was not checked.

4 Performance portability

NVIDIA GPUs are currently the dominant GPUs in use within HPC clusters, as reflected in the Top 500 lists. Figure 17 shows the number of systems using each vendors accelerators in the Nov 2025 list. In-part, this dominance is due to the maturity of the CUDA software and hardware ecosystem, which is mature, stable and with thorough documentation and powerful developer tools such as Nsight Systems and Nsight Compute.

Other vendors such as AMD also produce high performance GPUs suitable for HEP simulations which are deployed in large-scale HPC clusters. However, the software ecosystem around these devices is typically less mature and historically support for consumer AMD GPUs has been very poor.

Intel’s Ponte-Vecchio GPUs were adopted by several systems. Adoption of intel GPUs for scientific compute may have been limited by the repeated cancellations of successor architectures to Ponte-Vecchio.

As a result of NVIDIA dominance and investment, a large number of GPU codes only target NVIDIA GPUs, but research facilities may wish to procure non-NVIDIA devices, requiring code to be portable across vendors.

Figure 17: The number of systems in the November 2025 Top 500 list for each GPU vendor, from the ‘Accelerator/Co-Processor’ field. Non-accelerated systems are excluded.

4.1 GPGPU programming approaches

When implementing a GPU accelerated code, there are many approaches which can be taken including: low-level vendor-specific approaches; the use of a performance-portable library or toolchain; or higher-level abstractions such as directive based programming. The choice of approach will impact the peak performance which can be achieved, the ease of development and the portability of the code.

Low-level vendor-specific APIs such as CUDA (for NVIDIA GPUs) and ROCm/HIP (for AMD GPUs) provide access the functionality specific to a vendors hardware and are well-supported by the vendor with high quality tooling. CUDA is one of the most the dominant GPGPU programming approaches in use. Originally launched in 2007, CUDA has an extensive set of libraries available, high quality developer tools, and a (relatively) large community of users. CUDA is officially supported by Nvidia across the full-range of GPUs from low-cost consumer devices through to the much more costly data centre products (typically used in research HPC facilities). However, it is proprietary and can only be executed on Nvidia GPUs.

HIP/ROCm is AMD’s open-source equivalent to CUDA, which follows a very similar programming model to CUDA, with most (but not all) CUDA API methods having a direct HIP/ROCm equivalent. As such, it is quite natural for CUDA and HIP implementations to be very similar, or use a small abstraction layer to handle both with the same source code (while allowing the use of platform specific functionality when required).

Performance-portability abstraction libraries and toolchains allow write-once run-anywhere programming for multiple vendors GPUs and many support non-GPU targets as well. SYCL is an open-standard C++ programming model for heterogeneous computing (supporting CPUs, GPUs, FPGAs and other accelerators), managed by the Khronos Group. Individual SYCL implementations may support a single or multiple targets. Using a modern c++ dialect, users provide lambda functions which are then executed by an appropriate acceleration backend for the device used at runtime.

Libraries such as alpaka and kokkos are platform independent libraries which abstract away vendor-specific implementation details, allowing users to write modern C++ which can be executed on a range of devices. For an example of a High Energy Physics code using alpaka see traccc - demonstrator tracking chain for accelerators.

Directive-based approaches for parallelisation such as OpenMP and OpenACC which use directives to instruct compilers to generate parallel code from existing serial applications can also be used for GPU offload with more-recent versions, however as a high level abstraction not all problems are suited for GPU acceleration via directives.

4.2 Celeritas CUDA/HIP implementation

Celeritas is implemented in C++ with CUDA for NVIDIA GPUs and HIP/ROCm for AMD GPUs.

When configuring and building Celeritas, CMake options control if CUDA or HIP/ROCm are enabled (CELERITAS_USE_CUDA, CELERITAS_USE_HIP), and CMake will then check for compiler support. Only one of CUDA or ROCm can be enabled for a single build - they are mutually exclusive.

During compilation, C preprocessor definitions can be used to guard out implementation-specific code differences. I.e. CELERITAS_USE_CUDA is only defined when CUDA is enabled.

The main abstraction across CUDA and HIP/ROCm in Celeritas is the CELER_DEVICE_API_CALL macro, defined in src/corecel/Assert.hh. When passed a statement to execute, the macro prefixes the this provided statement with the appropriate library for the current build. For example:

CELER_DEVICE_API_CALL(DeviceSynchronize()));

results in a call to either

cudaDeviceSynchronize(); // for CUDA-enabled builds
hipDeviceSynchronize();  // for HIP/ROCm builds.

The macro also handles the error checking of the underlying API call, which is common practice in CUDA/HIP applications.

Although this macro-based approach works for the vast majority of CUDA/HIP api calls, there are still several occurrences within Celeritas where implementation specific differences are used, including but not limited to:

cudaHostAlloc vs hipHostMalloc for allocating pinned host memory
cudaFreeHost vs hipHostFree
__launch_bounds__’s second argument has a different meaning (minimum blocks vs minimum warps).
Slight difference in texture memory handling
Different cuda kernel launch syntax
Differences in cuda vs hip libraries, such as curand and rocrand or NVTX and ROCTX
Some workarounds for changes in older versions of HIP, or handling of mis-behaving external CUDA libraries

HIP/ROCm does not fully replicate the CUDA API, with some of the more recent or less-commonly used parts of the CUDA API being absent. This may prevent or delay the adoption of performance enhancing features, if the differences between implementations is maintained. For example, the CUDA Graph API has been extended in recent CUDA releases, but these features have not been fully implemented in HIP/ROCm.

4.2.1 VecGeom

Celeritas includes 2 options for geometry and navigation on-device - ORANGE and VecGeom (originally discussed in Section 1.1).

ORANGE, which is included within celeritas, is implemented for CUDA and HIP, so can be used on NVIDIA and AMD devices.
VecGeom is currently only implemented in CUDA.

This means that if using Celeritas on AMD devices, only ORANGE can be used for geometry and navigation.

However, the VecGeom development team plan to “Remove explicit CUDA dependencies and implement a portability layer (e.g. Alpaka)” (Ribon 2025), which would enable the use of VecGeom on AMD devices.

4.2.2 Other GPU vendor support

Plans to extend celeritas with support for Intel GPUs via SYCL were cancelled due to the lack of upcoming HPC-class Intel GPUs since the repeated cancellation of post-Ponte-Vecchio architectures.

4.3 Celeritas regression suite benchmarking on AMD GPUs

To compare performance on Nvidia and AMD devices, the celeritas regression suite (see Section 2.3) was executed on 2 AMD systems listed in Table 1 and compared to 2 of the previously benchmarked Nvidia systems from Section 2.3:

Awe - a local workstation containing 2 AMD Radeon 7900XTX GPUs
JADE@ARC / JADE 2.5 - a UK Tier-2 technology pilot resource containing 3 nodes of 8 MI300X GPUs per node
Blackmass - a local workstation containing 1 NVIDIA RTX 3080 GPU
N8 CIR Bede - a UK Tier-2 HPC facility which includes a pilot resource of 9 NVIDIA Grace Hopper superchips (GH200).

Optimised double-precision builds of celeritas were compiled using ROCm 6.4.2, and the benchmarks were executed using a single GPU and appropriate fraction of the host system.

Cross-vendor comparisons of hardware will always include a degree of bias, as the manufacturers operate on different product cycles to one another. The fairest comparison we can make in this case given the available systems is to compare the consumer-grade GPUs against one another (AMD Instinct 7900 XTX and NVIDIA RTX 3080), and the data-centre GPUs (AMD MI300X and NVIDIA GH200) against one another. However it is still far from ideal:

The 7900 XTX is a newer GPU and at a higher price-point in the respective product line-up than the RTX 3080
The host machine for the 7900 XTX is significantly newer than for the RTX 3080
The Nvidia Grace-Hopper and AMD MI300X systems are of a similar age, however the HPC nodes are very different, with 1 GPU per CPU in the Grace-Hopper system compared to 4 GPUs per CPU in the AMD MI300X system.

As such, these performance comparisons are far from ideal but do show the real-world observed differences between then individual models used.

Figure 18: Per-GPU celeritas simulation throughput for the celeritas-project/regression benchmark set, executed on Nvidia GH200, AMD Mi300X, AMD Radeon 7900 XTX & NVIDIA RTX 3080 GPUs

Figure 18 shows the per-GPU simulation throughput (higher is better) for the celeritas-project/regression benchmark set when executed on the available data-centre and consumer-level GPUs from NVIDIA and AMD. For geometries which can be executed using both ORANGE and VecGeom for geometry there are 2 markers per model for NVIDIA systems. The CMS2018 geometry (\(C_{R2}M\) and \(C_{R2}F_{U}\)) is not currently supported by ORANGE so does not run on AMD GPUs, and requires more than the 10GB of device memory available in the RTX 3080 for the number of tracks used in the regression suite.

The data-centre class GPUs which have higher double-precision compute performance (typically 1:2 rather than 1:32 or 1:64 in consumer devices) and higher device-memory bandwidth outperform the consumer-grade GPUs as you may expect for this FP64-heavy workload.

Between the data-centre class devices, the Nvidia GH200 achieved higher throughput than the AMD MI300X when comparing individual GPUs for most but not all geometries, with the largest advantage present for the simpler problem cases.

For the consumer devices, the newer AMD 7900XTX achieved higher throughput than the NVIDIA RTX 3080 in most cases. However once again the performance is in a similar region for most models.

Figure 19: Per-GPU celeritas simulation throughput for the celeritas-project/regression benchmark set for CPU and GPU builds, executed on the JADE@ARC (AMD Mi300X) & Awe (AMD Radeon 7900 XTX) systems, using the proportional amount of CPU cores for the fraction of GPUs in use

Figure 19 shows the regression benchmark suite for the AMD systems (JADE@ARC and Awe) including the CPU version of celer-sim and Geant4 interfaces. The CPU simulations used 1/4 of a CPU in JADE@ARC and 1/2 of the CPU in Awe, this would make for fairer node-level benchmarks, but adds some bias in favour of the GPU. CPU performance is generally lower than GPU performance. The 1/2 of a the consumer-grade CPU (12 of 24 cores) in Awe delivers higher simulation throughput than 1/4 of the CPU (16 of 64 cores) in JADE@ARC, however this is not a very fair comparison. The following GPU to CPU speedup ranges were reported by the celeritas-project/regression post processing:

JADE@ARC / JADE 2.5 showed GPU speedup of between 6.9× and 20×, compared to 1/4 of the CPU (1/8th of the 8 GPU node).
The 7900XTX in Awe demonstrated speedups between 1.7× and 3×, using 1/2 of the available CPU cores (in a 2-GPU system).

Overall, these results show that Celeritas is portable between Nvidia and AMD devices, with similar class GPUs offering comparable levels of performance. Due to variable GPU pricing, recommendations between vendors and individual GPU models will need to be made on a case by-case basis. The Celeritas regression suite may be a useful tool when making this evaluation.

Acknowledgements

This work made use of the facilities of the N8 Centre of Excellence in Computationally Intensive Research (N8 CIR) provided and funded by the N8 research partnership and EPSRC (Grant No. EP/T022167/1). The Centre is co-ordinated by the Universities of Durham, Manchester and York.

References

Apostolakis, J, M Bandieramonte, G Bitzes, R Brun, P Canal, F Carminati, G Cosmo, et al. 2015. “Towards a High Performance Geometry Library for Particle-Detector Simulations.” Journal of Physics: Conference Series 608 (1): 012023. https://doi.org/10.1088/1742-6596/608/1/012023.

Johnson, Seth R., Amanda Lund, Julien Esseiva, Elliott Biondo, Philippe Canal, Hayden Hollenbeck, Stefano Tognini, et al. 2025. “Celeritas 0.6.” Github. https://doi.org/10.5281/zenodo.17684516.

Ribon, Alberto. 2025. “Simulation – Geant4.” Presentation at the Architects Forum Meeting, CERN. https://indico.cern.ch/event/1607761/contributions/6774370/attachments/3168599/5632281/geant4_lhcc_18Nov2025.pdf.