HemeLB Performance Report

A test submission of HemeLB, available on the HemeLB GitHub repository master branch.

Disclaimers

This is not a commentary on code quality, but an indicator of the quality of the current SHAREing testing methodology as of this date, 20/03/26.
This forms only a preliminary assessment of submission suitability and does not guarantee a full assessment. The pre-assesment will be provided to the submitter with information on how to continue to assessment or rejection.

Preassessment Overview

1: Code

The main codebase for HemeLB can be cloned with

git clone https://github.com/UCL-CCS/HemePure.git

We use the master branch for analysis.

2: Description of working environment

Here we describe some details of the system we will be running on and how we are configuring the software environment.

2.1: Hardware information

We run on Hamilton, one of the HPC clusters hosted at Durham University. There are 120 standard compute nodes and 2 high-memory ones:

Specification	Per node
Processors	2x AMD EPYC 7702 64-Core Processor
Clock	3289.415MHz
Sockets	2
Cores	128
RAM	Standard: 256GB (246GB available to users) High-memory: 2TB
Local storage	400GB SSD

Each socket on each compute node is divided into 4 NUMA domains that each have a dedicated memory channel. Each of these NUMA domains are further split into 4 groups, where each has its own 16MB L3 cache (totalling to 256MB in one processor). Each of these groups contains 4 core that each have 512KB of L2 cache and 32KB of L1 cache.

2.2: Dependencies

We require a C compiler, MPI library and CMake for building

module load gcc openmpi cmake

Giving an environmental setup of:

Currently Loaded Modulefiles:
  1) gcc/11.2        2) openmpi/4.1.1   3) cmake/3.18.4

We also do not have to manually set any environment variables or add any paths.

3: Building

The build scheme is split into two phases: building dependencies then building the source code, listed below.

Dependency build

Create dep/build/ directory and cd into it
In this directory run ccmake -B. -H../ and press c to configure then e to exit

Then back at the command line configure with CMake using

cmake -DCMAKE_C_COMPILER=gcc \
   -DCMAKE_CXX_COMPILER=g++ \
   -DCMAKE_CXX_FLAGS="-g" \
   -DHEMELB_COMPUTE_ARCHITECTURE=NEUTRAL \
   -DCMAKE_CXX_EXTENSIONS=OFF \
   -DHEMELB_USE_VELOCITY_WEIGHTS_FILE=ON \
   -DHEMELB_INLET_BOUNDARY=LADDIOLET \
   -DHEMELB_WALL_INLET_BOUNDARY=LADDIOLETSBB \
   -DHEMELB_OUTLET_BOUNDARY=NASHZEROTHORDERPRESSUREIOLET \
   -DHEMELB_WALL_OUTLET_BOUNDARY=NASHZEROTHORDERPRESSURESBB \
   -DHEMELB_LOG_LEVEL="Info" \
   -DHEMELB_USE_MPI_PARALLEL_IO=OFF \
   -DCMAKE_BUILD_TYPE=Release \
   ..

Then run make in dep/build/

Source code build

Create src/build/ directory and cd into it
In this directory run ccmake -B. -H../ and again press c to configure and then e to exit

Then back at the command line configure with CMake using

cmake -DCMAKE_C_COMPILER=gcc \ 
   -DCMAKE_CXX_COMPILER=g++ \
   -DCMAKE_CXX_FLAGS="-g" \
   -DHEMELB_COMPUTE_ARCHITECTURE=NEUTRAL \
   -DCMAKE_CXX_EXTENSIONS=OFF \
   -DHEMELB_USE_VELOCITY_WEIGHTS_FILE=ON \
   -DHEMELB_INLET_BOUNDARY=LADDIOLET \
   -DHEMELB_WALL_INLET_BOUNDARY=LADDIOLETSBB \
   -DHEMELB_OUTLET_BOUNDARY=NASHZEROTHORDERPRESSUREIOLET \
   -DHEMELB_WALL_OUTLET_BOUNDARY=NASHZEROTHORDERPRESSURESBB \
   -DHEMELB_LOG_LEVEL="Info" \
   -DHEMELB_USE_MPI_PARALLEL_IO=OFF \
   -DCMAKE_BUILD_TYPE=Release \
   ..

Run make in src/build/

4: Running

With the build instructions documented, we now run through the necessary information for running the code.

4.1: Data

The code requires input data which is hosted on Zenodo. Here we chose the test data in TestPipe.tar.gz, in particular the /nobackup/<username>/NVIDIA-TestPipe/input_VP.xml input file.

4.2: Runtime commands

mpirun -np xx hemepure -in input.xml -out results

where input.xml is the file in the path: /nobackup/<username>/NVIDIA-TestPipe/input_VP.xml.

5: Code Complexity

Next to no scaling information is given in the documentation. There are input files given to generate a weak scaling, though we do not yet include that in the analysis here as the internode analysis is still under development.

6: Memory, Storage and I/O

For the TestPipe examples the peak memory consumption given by the sacct command is approximately 17.6GB. The storage for the input files in TestPipe is 15MB, and the storage for the outputs is approximately 4.8MB.

The code has an initial stage of reading in data from disk, and then writes to disk at discrete intervals throughput the runtime.

7: Assessment structure

Finally, we list the elements of the code we analyse and the associated tools.

7.1: Assessment Dimensions

We have identified that out of the five performance dimensions, the ones relevant to this assessment and benchmark are:

Core-level assessment
Intranode assessment
I/O assessment

7.2: Performance tools

For these three dimensions, at a high level we will use:

Core-level: LIKWID
Intranode: no tools needed, other than the inbuilt time function
I/O: darshan

High-level performance assessment report

As discussed above, we restrict our assessment to CPU, intranode and I/O. This code has the functionality for both internode and GPU, though we do not consider these here as the GPU version of the code is distinct, and more works is required on the internode performance methodology for SHAREing.

CPU Analysis

For a basic high-level analysis of CPU performance, we look for the floating-point operation rate compared to the theoretical rate for the CPU.

The hardware capabilities were determined with

likwid-bench -t peakflops -W N:128*16kB:128

The software CPU compute rate was determined with

likwid-mpirun -np 128 -nperdomain N:128 -g FLOPS_DP -- ./hemepure -in /nobackup/<username>/NVIDIA-TestPipe/input_VP.xml -out results_core_128proc

	MFLOP/s
CPU	921372.41
Measured	30554.2358

We determine this software to have a CPU score of 0.033161657.

IO Analysis

For a basic high-level analysis of IO performance, we look for the proportion of the runtime spent processing IO requests.

The IO time was determined with

$ module load gcc openmpi darshan

$ export DARSHAN_DIR=/apps/developers/tools/darshan/3.3.1/1/gcc-11.2-openmpi-4.1.1
$ export DARSHAN_LOGDIR=./darshan_logs_4proc
$ LD_PRELOAD=$DARSHAN_DIR/lib/libdarshan.so mpirun -np 128 ./hemepure -in /nobackup/<username>/NVIDIA-TestPipe/input_VP.xml -out results_io_128proc 

We process these logs with

$ darshan-parser --perf darshan_logs_128proc/darshan_log_id.darshan

which gives a few important lines. One for POSIX

# shared files: time_by_cumul_io_only: 0.028255

one for MPI-IO

# shared files: time_by_cumul_io_only: 2.020328

and finally one for STDIO

# shared files: time_by_cumul_io_only: 0.010979

Therefore, the I/O operations are split across POSIX, MPIIO and STDIO. This gives total I/O (reads, writes and metadata operations) runtimes of 0.028255s, 2.020328s and 0.010979s for POSIX, MPIIO and STDIO, respectively, averaged across all 128 MPI ranks.

For a total runtime of 235s, the IO utilisation ratio is 0.008764094, and the IO score is 0.991235906.

Intranode Analysis

For a basic high-level analysis of intranode performance, we perform a strong scaling by fixing the problem size and increasing core allocation.

For this code, we tested with core counts in powers of 2 from 4 to 128.

Thread count	Time (s)	Parallel efficiency
4	3310.983	1.000
8	1854.760	0.893
16	1786.564	0.463
32	1753.458	0.236
64	927.553	0.223
128	891.100	0.116

Hence, our 80% threshold is at 8 cores and our 60% threshold is at 8 cores. As a proportion of the number of cores available, which is 128 on the node this was run on, this gives a score of 0.0625 and 0.0625.

Summary

The following table collates the results of all above sections. These scores are indicative only, and cannot truly be compared to one another meaningfully without taking into account domain knowledge and methodological differences between them.

Result	Score	Metric result
CPU	0.033161657	30554.2358FLOP/s
IO	0.991235906	2.059562
Intranode (80%)	0.0625	8 cores

In summary we can see that the I/O performance is extremely good, as the benchmark spends the vast majority of its time in computation. However, the core and intranode performance are considerably lower. Again the intranode scaling was performed relative to a 4 core run, rather than a serial run as is typical, though the single node scaling still drops off very rapidly.

Following this high-level assessment, it is recommended that both the core and intranode performance are investigated in depth. Similarly, the GPU version of HemeLB must be analysed to study more compute rate that this benchmark can extract on a GPU. Finally, a internode analysis in the near future would be very interesting as HemeLB appears to focus on beyond single node scaling.