Today, Amazon SageMaker HyperPod announces the general availability of the health monitoring agent for Slurm clusters. SageMaker HyperPod helps you provision resilient clusters for running machine learning (ML) workloads and developing state-of-the-art models such as large language models (LLMs), diffusion models, and foundation models (FMs). The health monitoring agent performs passive, background health checks of instances to identify problems in key areas without impact on application behavior or performance, flags failures instantly, and replaces any unhealthy instances to keep your training jobs running smoothly.
The agent runs continuously on all GPU- or Trainium-based nodes in your HyperPod cluster, watching for hardware issues such as unresponsive GPUs or NVLink error counters. When a fault is detected, it marks the node as unhealthy and automatically reboots or replaces it with a healthy node, keeping your jobs running without requiring manual intervention. The agent also follows a co-ordinated approach to handling failures with the job auto-resume functionality available with Slurm clusters. For example, jobs with auto-resume enabled will continue from the last saved checkpoint once nodes are replaced by the agent. This hands-free recovery—already available on HyperPod clusters orchestrated with Amazon EKS—now gives Slurm clusters the same resilient environment, helping teams train large models for weeks without disruption and reclaim time and costs that would otherwise be lost to mid-run failures. In addition, customers can now also reboot their nodes using a simple command in case of intermittent issues such as GPU driver issues requiring reset.
Health monitoring agent for Slurm is available in all regions where HyperPod is generally available. The agent is auto-enabled on all newly created Slurm clusters; to enable it on an existing cluster, simply upgrade to the latest HyperPod AMI by calling the UpdateClusterSoftware API. To learn more, visit the Amazon SageMaker HyperPod documentation.
Categories: marketing:marchitecture/artificial-intelligence
Source: Amazon Web Services
Latest Posts
- Power Automate – Create and visualize custom KPIs in the process intelligence experience [MC1310386]
![Power Automate - Create and visualize custom KPIs in the process intelligence experience [MC1310386] 2 pexels therato 3408744](data:image/svg+xml;base64,PHN2ZyB3aWR0aD0iMSIgaGVpZ2h0PSIxIiB4bWxucz0iaHR0cDovL3d3dy53My5vcmcvMjAwMC9zdmciPjwvc3ZnPg==)
- Amazon SageMaker HyperPod now supports data capture for inference workloads

- Microsoft Teams: Front-of-room view control for Webinars and structured meetings in Teams Rooms on Android [MC1316231]
![Microsoft Teams: Front-of-room view control for Webinars and structured meetings in Teams Rooms on Android [MC1316231] 4 pexels pixabay 276517](data:image/svg+xml;base64,PHN2ZyB3aWR0aD0iMSIgaGVpZ2h0PSIxIiB4bWxucz0iaHR0cDovL3d3dy53My5vcmcvMjAwMC9zdmciPjwvc3ZnPg==)
- New flexibility and choice for sharing organizational data across Microsoft 365 and Viva apps [MC1316232]
![New flexibility and choice for sharing organizational data across Microsoft 365 and Viva apps [MC1316232] 5 laptop 3087585 1280](data:image/svg+xml;base64,PHN2ZyB3aWR0aD0iMSIgaGVpZ2h0PSIxIiB4bWxucz0iaHR0cDovL3d3dy53My5vcmcvMjAwMC9zdmciPjwvc3ZnPg==)

![Power Automate - Create and visualize custom KPIs in the process intelligence experience [MC1310386] 2 pexels therato 3408744](https://mwpro.co.uk/wp-content/uploads/2024/08/pexels-therato-3408744-150x150.webp)

![Microsoft Teams: Front-of-room view control for Webinars and structured meetings in Teams Rooms on Android [MC1316231] 4 pexels pixabay 276517](https://mwpro.co.uk/wp-content/uploads/2025/06/pexels-pixabay-276517-150x150.webp)
![New flexibility and choice for sharing organizational data across Microsoft 365 and Viva apps [MC1316232] 5 laptop 3087585 1280](https://mwpro.co.uk/wp-content/uploads/2025/06/laptop-3087585_1280-150x150.webp)
