Today, we’re announcing general availability of Amazon SageMaker HyperPod training operator, a purpose-built Kubernetes extension for resilient foundation model training on HyperPod.
Amazon SageMaker HyperPod empowers customers to accelerate AI model development across hundreds or thousands of GPUs with built-in resiliency, decreasing model training time by up to 40%. As training clusters expand, recovery from training interruptions becomes increasingly disruptive. Failure recovery traditionally requires a complete job restart across all nodes when even a single training process fails, resulting in additional downtime and increased costs. Moreover, identifying and resolving critical training issues such as stalled GPUs, low training throughput, and numerical instabilities, typically requires complex custom monitoring code, further extending development timelines and delaying time to market.
With the HyperPod training operator, customers can further enhance training resilience for Kubernetes workloads. Instead of a full job restart when failures occur, the HyperPod training operator performs surgical recovery, selectively restarting only the affected training resources for faster recovery from faults. It also introduces a customizable hanging job monitoring capability to help overcome problematic training scenarios including stalled training batches, non-numeric loss values, and performance degradation through simple YAML configurations. Getting started is simple: create a HyperPod cluster, install the training operator add-on, optionally define custom recovery policies for hanging jobs, and launch training.
This release is generally available in all AWS Regions where SageMaker HyperPod is currently supported.
See the documentation to learn more.
Categories: general:products/amazon-machine-learning,marketing:marchitecture/artificial-intelligence
Source: Amazon Web Services
Latest Posts
- GCP Release Notes: February 11, 2026

- Microsoft 365 Copilot: Updated handling of entity inserts in the Copilot Chat input box [MC1230892]
![Microsoft 365 Copilot: Updated handling of entity inserts in the Copilot Chat input box [MC1230892] 3 pexels frank cone 140140 3573555](data:image/svg+xml;base64,PHN2ZyB3aWR0aD0iMSIgaGVpZ2h0PSIxIiB4bWxucz0iaHR0cDovL3d3dy53My5vcmcvMjAwMC9zdmciPjwvc3ZnPg==)
- (Updated) Open Word, Excel, and PowerPoint Files in Microsoft 365 Copilot Chat [MC1225199]
![(Updated) Open Word, Excel, and PowerPoint Files in Microsoft 365 Copilot Chat [MC1225199] 4 pexels googledeepmind 17485683](data:image/svg+xml;base64,PHN2ZyB3aWR0aD0iMSIgaGVpZ2h0PSIxIiB4bWxucz0iaHR0cDovL3d3dy53My5vcmcvMjAwMC9zdmciPjwvc3ZnPg==)
- Dynamics 365 Project Operations – Modern architecture for existing legal entities with additional transactions support [MC1230895]
![Dynamics 365 Project Operations – Modern architecture for existing legal entities with additional transactions support [MC1230895] 5 pexels magda ehlers pexels 1300345](data:image/svg+xml;base64,PHN2ZyB3aWR0aD0iMSIgaGVpZ2h0PSIxIiB4bWxucz0iaHR0cDovL3d3dy53My5vcmcvMjAwMC9zdmciPjwvc3ZnPg==)


![Microsoft 365 Copilot: Updated handling of entity inserts in the Copilot Chat input box [MC1230892] 3 pexels frank cone 140140 3573555](https://mwpro.co.uk/wp-content/uploads/2024/08/pexels-frank-cone-140140-3573555-150x150.webp)
![(Updated) Open Word, Excel, and PowerPoint Files in Microsoft 365 Copilot Chat [MC1225199] 4 pexels googledeepmind 17485683](https://mwpro.co.uk/wp-content/uploads/2024/08/pexels-googledeepmind-17485683-150x150.webp)
![Dynamics 365 Project Operations – Modern architecture for existing legal entities with additional transactions support [MC1230895] 5 pexels magda ehlers pexels 1300345](https://mwpro.co.uk/wp-content/uploads/2024/08/pexels-magda-ehlers-pexels-1300345-150x150.webp)
