Utilisation monitoring of accelerated compute - Task 021
Fit to programme
This task has been identified by the working groups as part of the agenda behind WP 2.3.
The task number is 021.
Description
Many centre leaders, including those with experience, are unsure on how to best monitor utilisation of accelerated compute resources. While schedulers have long integrated tools for monitoring CPU utilisation, the landscape of tooling for GPU and other accelerator utilisation is more fragmented, with less consensus on what constitutes best practice. Many sites do not offer as much monitoring as users, RSEs, and RIEs would find useful as a result. While tools to visualise the raw device utilisation as a time history at system level aren’t uncommon, how to turn this into actionable information—utilisation or efficiency for specific projects or jobs, for example.
It is an ambition of SHAREing to produce training on best (or good enough) practice in this space. In advance of this, it is necessary to identify what the state of the art is, and form a consensus on a set of good practices that can be shared more widely.
As such, the aim of this Task is a workshop bringing together tool developers and RIEs deploying tools in shared accelerated compute environments, to present what options are currently in this space, and what the relative advantages of each are.
This may be delivered hybrid in-person, or purely online. The successful applicant should work closely with the WP2.3 coordinator to ensure that the tooling to be discussed aligns with the features seem to be missing.
Outcomes
- Delivery of a workshop (hybrid or online) exploring what options are currently available for utilisation monitoring of accelerated compute, and which are in use
- Representation must include commercial software vendors, open-source projects, and HPC centres deploying solutions
- A report summarising the outcomes from the workshop, including
- Descriptions of each option
- Comparison (e.g. tabular) of features of each option
- Recommendations for circumstances where specific tools may be optimal