Fit to programme

This task has been identified by the working groups as part of the agenda behind WP 2.3.

The task number is 027.

Summary

The recent boom in AI is enabled and facilitated by the HPC facilities that AI is trained and deployed on. However traditional HPC facilities use very little operational AI internally, and RTPs generally have little to no AI training.

This project should investigate and summarise to what extent AI is being used internally for cluster management around the globe, and identify areas where AI could be of use. In particular it should focus on open source self-hosted solutions which RTP staff can install, operate, manage and understand, rather than black box solutions.

It should also seek to deploy a prototype AI assistant monitoring system logs and emails to identify likely problems. This will take a simple approach and by RTP-led to ensure that such a system is both useful and maintainable. However it is not expected to take any actions, rather it will offer advice and prepare scripts that should be run after human oversight and intervention.

The limitations and risks of AI capabilities should also be documented, and guidance for safe deployment developed. Allowance for institutional differences and restrictions should also be made, and open source solutions should be favoured.

Approach

A landscape survey should be carried out into current use of AI in HPC infrastructure (i.e. used to manage and monitor the infrastructure), and outputs documented online.

It will also identify areas of system management where AI could be of potential use, and suggest open source solutions for each of these.

A key component of this project should be the development of a framework for an open source AI assistant used to monitor system logs and emails. By using open source components, such as Ollama, and by integrating logging facilities in the correct way, the project will seek to deliver a workable solution able to identify faults and problems, and also describe a framework which can be used for other systems and input data.

Results should be presented at major UK conferences, including HPC Days.

Outputs

Outputs should include an online landscape review hosted in the SHAREing website, suggestions for applicability of AI, and a prototype AI-assisted management tool, plus a framework for implementation of further aids.