The AMD ROCm Blogs site hosts a growing number of technical articles written by various teams at AMD with experience in optimising software for AMD GPUs. Many of these articles are directly relevant for scientific software developers, computational scientists, and research software engineers interested in kernel optimisation, profiling methodologies, GPU hardware architecture, and programming models on AMD platforms.
This page brings together several articles from across the ROCm Blogs catalogue. Whether you are porting an application to AMD hardware, learning to profile with ROCm tools, or exploring the performance capabilities of your application, this is a good place to start.
Optimisation
These articles address the practical work of making HPC codes run well on AMD GPUs: memory-bandwidth analysis, stencil tuning, communication libraries, process placement, and sparse linear algebra. Several are multi-part series that build progressively from a baseline implementation to more advanced tuning strategies.
Finite difference method -- Laplacian (4-part series)
A HIP implementation of a finite-difference Laplacian stencil, progressing through roofline analysis, loop tiling, register pressure management, and cross-architecture scaling.
- Laplacian part 1: baseline HIP kernel and roofline-style analysis
- Laplacian part 2: loop tiling and reordered read patterns
- Laplacian part 3: register pressure, launch bounds, non-temporal stores
- Laplacian part 4: scaling across hardware, cache limits, grid and subdomain strategies
Seismic stencil codes (3-part series)
Performance optimisation of seismic stencil codes on AMD GPUs, from a baseline GPU implementation through to advanced tuning and performance results.
- Seismic stencil codes - part 1: introduction and baseline GPU implementation
- Seismic stencil codes - part 2: optimisation strategies and memory access patterns
- Seismic stencil codes - part 3: advanced tuning and performance results
Affinity, placement, and order (2-part series)
Process affinity, NUMA topology, and binding strategies for HPC workloads, including a case study on the Frontier supercomputer.
- Affinity part 1: NUMA concepts, process placement and ordering strategies, and a Frontier node case study
- Affinity part 2: topology discovery tools, verifying affinity, and binding techniques for MPI, OpenMP, and hybrid applications
Standalone articles
- Sparse matrix vector multiplication - part 1: SpMV implementation and performance considerations on AMD hardware
- Understanding RCCL bandwidth and xGMI performance on AMD Instinct MI300X: inter-GPU communication bandwidth and collective performance on MI300X
- Register pressure in AMD CDNA2 GPUs: understanding and managing register allocation and occupancy on CDNA2 architecture
Profiling and tooling
These articles cover the AMD profiling ecosystem, portability toolchains, and compilers: the software infrastructure that supports development and performance analysis on AMD GPUs.
Performance profiling on AMD GPUs (3-part series)
A structured guide to profiling on AMD hardware, moving from conceptual foundations through basic tool usage to advanced analysis techniques.
- Part 1: Foundations: profiling concepts and introduction
- Part 2: Basic usage: first steps with ROCm profiling tools
- Part 3: Advanced usage: multi-device profiling and optimisation workflows
Standalone articles
- Introduction to profiling tools for AMD hardware: an overview of the ROCm profiling and tracing tool landscape
- Introducing ROCprofiler SDK: the latest toolkit for performance profiling on AMD GPUs
- Application portability with HIP: porting CUDA applications to HIP for AMD hardware
- Introducing AMD's next-gen Fortran compiler: LLVM Flang-based compiler with OpenMP GPU offload and HIP interop
GPU architecture and programming models
These articles cover AMD GPU architecture from matrix cores and memory hierarchies to ISA-level detail, alongside practical programming guides for HIP, OpenMP offloading, and C++ parallel algorithms.
Standalone articles
- AMD matrix cores: an introduction to matrix core hardware and programming on AMD GPUs
- AMD Instinct MI200 GPU memory space overview: memory hierarchy and address spaces on MI200
- Jacobi solver with HIP and OpenMP offloading: implementing a classic iterative solver on AMD GPUs using two programming models
- C++17 parallel algorithms and HIPSTDPAR: running standard C++ parallel algorithms on AMD GPUs via HIPSTDPAR
- Reading AMD GPU ISA: a practical guide to reading and interpreting AMD GCN/CDNA instruction set output
- MI300A -- Exploring the APU advantage: programming and performance considerations for the MI300A accelerated processing unit
- GPU-aware MPI with ROCm: GPU-aware MPI programming model and direct device-buffer communication on AMD GPUs
About these blogs
The ROCm Blogs site is maintained by AMD and publishes technical content spanning high-performance computing, artificial intelligence, and GPU software development. The articles presented on this page were produced by AMD's HPC and AI Application Enablement teams. For the full catalogue, visit rocm.blogs.amd.com.