External

ROCm Blogs: Guides for HPC & AI Practitioners

A collection of technical articles covering profiling and optimisation on AMD GPUs
Thomas Gibson
Thomas Gibson April 2026

The AMD ROCm Blogs site hosts a growing number of technical articles written by various teams at AMD with experience in optimising software for AMD GPUs. Many of these articles are directly relevant for scientific software developers, computational scientists, and research software engineers interested in kernel optimisation, profiling methodologies, GPU hardware architecture, and programming models on AMD platforms.

This page brings together several articles from across the ROCm Blogs catalogue. Whether you are porting an application to AMD hardware, learning to profile with ROCm tools, or exploring the performance capabilities of your application, this is a good place to start.

Optimisation

These articles address the practical work of making HPC codes run well on AMD GPUs: memory-bandwidth analysis, stencil tuning, communication libraries, process placement, and sparse linear algebra. Several are multi-part series that build progressively from a baseline implementation to more advanced tuning strategies.

Finite difference method -- Laplacian (4-part series)

A HIP implementation of a finite-difference Laplacian stencil, progressing through roofline analysis, loop tiling, register pressure management, and cross-architecture scaling.

  1. Laplacian part 1: baseline HIP kernel and roofline-style analysis
  2. Laplacian part 2: loop tiling and reordered read patterns
  3. Laplacian part 3: register pressure, launch bounds, non-temporal stores
  4. Laplacian part 4: scaling across hardware, cache limits, grid and subdomain strategies

Seismic stencil codes (3-part series)

Performance optimisation of seismic stencil codes on AMD GPUs, from a baseline GPU implementation through to advanced tuning and performance results.

  1. Seismic stencil codes - part 1: introduction and baseline GPU implementation
  2. Seismic stencil codes - part 2: optimisation strategies and memory access patterns
  3. Seismic stencil codes - part 3: advanced tuning and performance results

Affinity, placement, and order (2-part series)

Process affinity, NUMA topology, and binding strategies for HPC workloads, including a case study on the Frontier supercomputer.

  1. Affinity part 1: NUMA concepts, process placement and ordering strategies, and a Frontier node case study
  2. Affinity part 2: topology discovery tools, verifying affinity, and binding techniques for MPI, OpenMP, and hybrid applications

Standalone articles

Profiling and tooling

These articles cover the AMD profiling ecosystem, portability toolchains, and compilers: the software infrastructure that supports development and performance analysis on AMD GPUs.

Performance profiling on AMD GPUs (3-part series)

A structured guide to profiling on AMD hardware, moving from conceptual foundations through basic tool usage to advanced analysis techniques.

  1. Part 1: Foundations: profiling concepts and introduction
  2. Part 2: Basic usage: first steps with ROCm profiling tools
  3. Part 3: Advanced usage: multi-device profiling and optimisation workflows

Standalone articles

GPU architecture and programming models

These articles cover AMD GPU architecture from matrix cores and memory hierarchies to ISA-level detail, alongside practical programming guides for HIP, OpenMP offloading, and C++ parallel algorithms.

Standalone articles

About these blogs

The ROCm Blogs site is maintained by AMD and publishes technical content spanning high-performance computing, artificial intelligence, and GPU software development. The articles presented on this page were produced by AMD's HPC and AI Application Enablement teams. For the full catalogue, visit rocm.blogs.amd.com.