Today, we’re announcing general availability of Amazon SageMaker HyperPod training operator, a purpose-built Kubernetes extension for resilient foundation model training on HyperPod.
Amazon SageMaker HyperPod empowers customers to accelerate AI model development across hundreds or thousands of GPUs with built-in resiliency, decreasing model training time by up to 40%. As training clusters expand, recovery from training interruptions becomes increasingly disruptive. Failure recovery traditionally requires a complete job restart across all nodes when even a single training process fails, resulting in additional downtime and increased costs. Moreover, identifying and resolving critical training issues such as stalled GPUs, low training throughput, and numerical instabilities, typically requires complex custom monitoring code, further extending development timelines and delaying time to market.
With the HyperPod training operator, customers can further enhance training resilience for Kubernetes workloads. Instead of a full job restart when failures occur, the HyperPod training operator performs surgical recovery, selectively restarting only the affected training resources for faster recovery from faults. It also introduces a customizable hanging job monitoring capability to help overcome problematic training scenarios including stalled training batches, non-numeric loss values, and performance degradation through simple YAML configurations. Getting started is simple: create a HyperPod cluster, install the training operator add-on, optionally define custom recovery policies for hanging jobs, and launch training.
This release is generally available in all AWS Regions where SageMaker HyperPod is currently supported.
See the documentation to learn more.
Categories: general:products/amazon-machine-learning,marketing:marchitecture/artificial-intelligence
Source: Amazon Web Services
Latest Posts
- SharePoint: OneDrive and SharePoint – recognize text in PDFs [MC1192663]
![SharePoint: OneDrive and SharePoint – recognize text in PDFs [MC1192663] 2 pexels eberhardgross 1287142](data:image/svg+xml;base64,PHN2ZyB3aWR0aD0iMSIgaGVpZ2h0PSIxIiB4bWxucz0iaHR0cDovL3d3dy53My5vcmcvMjAwMC9zdmciPjwvc3ZnPg==)
- Workspaces in Edge: Upcoming migration to improved architecture [MC1192661]
![Workspaces in Edge: Upcoming migration to improved architecture [MC1192661] 3 finger 1697331 1920](data:image/svg+xml;base64,PHN2ZyB3aWR0aD0iMSIgaGVpZ2h0PSIxIiB4bWxucz0iaHR0cDovL3d3dy53My5vcmcvMjAwMC9zdmciPjwvc3ZnPg==)
- (Updated) Microsoft Edge for Business and Microsoft 365 Copilot: Introducing “What’s New” page after major updates [MC1188225]
![(Updated) Microsoft Edge for Business and Microsoft 365 Copilot: Introducing “What’s New” page after major updates [MC1188225] 4 pexels pixabay 39828](data:image/svg+xml;base64,PHN2ZyB3aWR0aD0iMSIgaGVpZ2h0PSIxIiB4bWxucz0iaHR0cDovL3d3dy53My5vcmcvMjAwMC9zdmciPjwvc3ZnPg==)
- Microsoft 365 admin center: Copilot settings – readiness [MC1192665]
![Microsoft 365 admin center: Copilot settings – readiness [MC1192665] 5 pexels frostroomhead 9436715](data:image/svg+xml;base64,PHN2ZyB3aWR0aD0iMSIgaGVpZ2h0PSIxIiB4bWxucz0iaHR0cDovL3d3dy53My5vcmcvMjAwMC9zdmciPjwvc3ZnPg==)

![SharePoint: OneDrive and SharePoint – recognize text in PDFs [MC1192663] 2 pexels eberhardgross 1287142](https://mwpro.co.uk/wp-content/uploads/2024/08/pexels-eberhardgross-1287142-150x150.webp)
![Workspaces in Edge: Upcoming migration to improved architecture [MC1192661] 3 finger 1697331 1920](https://mwpro.co.uk/wp-content/uploads/2025/06/finger-1697331_1920-150x150.webp)
![(Updated) Microsoft Edge for Business and Microsoft 365 Copilot: Introducing “What’s New” page after major updates [MC1188225] 4 pexels pixabay 39828](https://mwpro.co.uk/wp-content/uploads/2024/08/pexels-pixabay-39828-150x150.webp)
![Microsoft 365 admin center: Copilot settings – readiness [MC1192665] 5 pexels frostroomhead 9436715](https://mwpro.co.uk/wp-content/uploads/2024/08/pexels-frostroomhead-9436715-150x150.webp)
