External

ROCm Blogs: Guides for HPC & AI Practitioners

A collection of technical articles covering profiling and optimisation on AMD GPUs

Thomas Gibson April 2026

The AMD ROCm Blogs site hosts a growing number of technical articles written by various teams at AMD with experience in optimising software for AMD GPUs. Many of these articles are directly relevant for scientific software developers, computational scientists, and research software engineers interested in kernel optimisation, profiling methodologies, GPU hardware architecture, and programming models on AMD platforms.

This page brings together several articles from across the ROCm Blogs catalogue. Whether you are porting an application to AMD hardware, learning to profile with ROCm tools, or exploring the performance capabilities of your application, this is a good place to start.

Optimisation

These articles address the practical work of making HPC codes run well on AMD GPUs: memory-bandwidth analysis, stencil tuning, communication libraries, process placement, and sparse linear algebra. Several are multi-part series that build progressively from a baseline implementation to more advanced tuning strategies.

Finite difference method -- Laplacian (4-part series)

A HIP implementation of a finite-difference Laplacian stencil, progressing through roofline analysis, loop tiling, register pressure management, and cross-architecture scaling.

Laplacian part 1: baseline HIP kernel and roofline-style analysis
Laplacian part 2: loop tiling and reordered read patterns
Laplacian part 3: register pressure, launch bounds, non-temporal stores
Laplacian part 4: scaling across hardware, cache limits, grid and subdomain strategies

Seismic stencil codes (3-part series)

Performance optimisation of seismic stencil codes on AMD GPUs, from a baseline GPU implementation through to advanced tuning and performance results.

Seismic stencil codes - part 1: introduction and baseline GPU implementation
Seismic stencil codes - part 2: optimisation strategies and memory access patterns
Seismic stencil codes - part 3: advanced tuning and performance results

Affinity, placement, and order (2-part series)

Process affinity, NUMA topology, and binding strategies for HPC workloads, including a case study on the Frontier supercomputer.

Affinity part 1: NUMA concepts, process placement and ordering strategies, and a Frontier node case study
Affinity part 2: topology discovery tools, verifying affinity, and binding techniques for MPI, OpenMP, and hybrid applications

Standalone articles

Sparse matrix vector multiplication - part 1: SpMV implementation and performance considerations on AMD hardware
Understanding RCCL bandwidth and xGMI performance on AMD Instinct MI300X: inter-GPU communication bandwidth and collective performance on MI300X
Register pressure in AMD CDNA2 GPUs: understanding and managing register allocation and occupancy on CDNA2 architecture

Profiling and tooling

These articles cover the AMD profiling ecosystem, portability toolchains, and compilers: the software infrastructure that supports development and performance analysis on AMD GPUs.

Performance profiling on AMD GPUs (3-part series)

A structured guide to profiling on AMD hardware, moving from conceptual foundations through basic tool usage to advanced analysis techniques.

Part 1: Foundations: profiling concepts and introduction
Part 2: Basic usage: first steps with ROCm profiling tools
Part 3: Advanced usage: multi-device profiling and optimisation workflows

Standalone articles

Introduction to profiling tools for AMD hardware: an overview of the ROCm profiling and tracing tool landscape
Introducing ROCprofiler SDK: the latest toolkit for performance profiling on AMD GPUs
Application portability with HIP: porting CUDA applications to HIP for AMD hardware
Introducing AMD's next-gen Fortran compiler: LLVM Flang-based compiler with OpenMP GPU offload and HIP interop

GPU architecture and programming models

These articles cover AMD GPU architecture from matrix cores and memory hierarchies to ISA-level detail, alongside practical programming guides for HIP, OpenMP offloading, and C++ parallel algorithms.

Standalone articles

AMD matrix cores: an introduction to matrix core hardware and programming on AMD GPUs
AMD Instinct MI200 GPU memory space overview: memory hierarchy and address spaces on MI200
Jacobi solver with HIP and OpenMP offloading: implementing a classic iterative solver on AMD GPUs using two programming models
C++17 parallel algorithms and HIPSTDPAR: running standard C++ parallel algorithms on AMD GPUs via HIPSTDPAR
Reading AMD GPU ISA: a practical guide to reading and interpreting AMD GCN/CDNA instruction set output
MI300A -- Exploring the APU advantage: programming and performance considerations for the MI300A accelerated processing unit
GPU-aware MPI with ROCm: GPU-aware MPI programming model and direct device-buffer communication on AMD GPUs

About these blogs

The ROCm Blogs site is maintained by AMD and publishes technical content spanning high-performance computing, artificial intelligence, and GPU software development. The articles presented on this page were produced by AMD's HPC and AI Application Enablement teams. For the full catalogue, visit rocm.blogs.amd.com.