Amazon SageMaker HyperPod now offers comprehensive observability for Restricted Instance Groups (RIG), enabling teams training foundation models with Nova Forge to gain deep visibility into their compute resources and training workloads. This new capability eliminates the manual effort of collecting and correlating metrics across the infrastructure stack, providing a unified view of GPU performance, system health, network throughput, and Kubernetes cluster state through a pre-configured Amazon Managed Grafana dashboard backed by Amazon Managed Service for Prometheus.
You can now monitor GPU utilization, NVLink bandwidth, CPU pressure, FSx for Lustre usage, and pod lifecycle from a single Grafana dashboard, with metrics collected across four exporters covering GPU performance, host-level system health, network fabric, and Kubernetes object state. In addition, curated logs are automatically made available in these dashboards, covering epoch progress, step-level training logs, pipeline errors, and Python tracebacks, so you can quickly diagnose training failures. HyperPod Observability for Restricted Instance Group is automatically enabled when you create a new cluster using RIGs, or can be enabled for existing clusters in a few clicks in the HyperPod cluster management console.
Amazon SageMaker HyperPod RIG observability is available in all AWS Regions where SageMaker HyperPod RIG is supported. To learn more, visit the documentation.
Categories: marketing:marchitecture/artificial-intelligence
Source: Amazon Web Services
Latest Posts
- (Updated) Microsoft 365 admin center: Organizational Messages to support email delivery [MC1189665]
![(Updated) Microsoft 365 admin center: Organizational Messages to support email delivery [MC1189665] 2 pexels googledeepmind 25626520](data:image/svg+xml;base64,PHN2ZyB3aWR0aD0iMSIgaGVpZ2h0PSIxIiB4bWxucz0iaHR0cDovL3d3dy53My5vcmcvMjAwMC9zdmciPjwvc3ZnPg==)
- Message center post structure updates may require admin script changes [MC1282308]
![Message center post structure updates may require admin script changes [MC1282308] 3 pexels pixabay 86596](data:image/svg+xml;base64,PHN2ZyB3aWR0aD0iMSIgaGVpZ2h0PSIxIiB4bWxucz0iaHR0cDovL3d3dy53My5vcmcvMjAwMC9zdmciPjwvc3ZnPg==)
- (Updated) Microsoft 365 admin center: Organizational Messages now support Action Segments for Microsoft 365 Copilot [MC1189666]
![(Updated) Microsoft 365 admin center: Organizational Messages now support Action Segments for Microsoft 365 Copilot [MC1189666] 4 pexels timmossholder 1154739](data:image/svg+xml;base64,PHN2ZyB3aWR0aD0iMSIgaGVpZ2h0PSIxIiB4bWxucz0iaHR0cDovL3d3dy53My5vcmcvMjAwMC9zdmciPjwvc3ZnPg==)
- Modernized Change Management for Microsoft 365 [MC1282306]
![Modernized Change Management for Microsoft 365 [MC1282306] 5 pexels letssnacktoronto 1630344](data:image/svg+xml;base64,PHN2ZyB3aWR0aD0iMSIgaGVpZ2h0PSIxIiB4bWxucz0iaHR0cDovL3d3dy53My5vcmcvMjAwMC9zdmciPjwvc3ZnPg==)

![(Updated) Microsoft 365 admin center: Organizational Messages to support email delivery [MC1189665] 2 pexels googledeepmind 25626520](https://mwpro.co.uk/wp-content/uploads/2024/08/pexels-googledeepmind-25626520-150x150.webp)
![Message center post structure updates may require admin script changes [MC1282308] 3 pexels pixabay 86596](https://mwpro.co.uk/wp-content/uploads/2025/06/pexels-pixabay-86596-150x150.webp)
![(Updated) Microsoft 365 admin center: Organizational Messages now support Action Segments for Microsoft 365 Copilot [MC1189666] 4 pexels timmossholder 1154739](https://mwpro.co.uk/wp-content/uploads/2024/08/pexels-timmossholder-1154739-150x150.webp)
![Modernized Change Management for Microsoft 365 [MC1282306] 5 pexels letssnacktoronto 1630344](https://mwpro.co.uk/wp-content/uploads/2025/06/pexels-letssnacktoronto-1630344-150x150.webp)
