Amazon SageMaker HyperPod now offers comprehensive observability for Restricted Instance Groups (RIG), enabling teams training foundation models with Nova Forge to gain deep visibility into their compute resources and training workloads. This new capability eliminates the manual effort of collecting and correlating metrics across the infrastructure stack, providing a unified view of GPU performance, system health, network throughput, and Kubernetes cluster state through a pre-configured Amazon Managed Grafana dashboard backed by Amazon Managed Service for Prometheus.
You can now monitor GPU utilization, NVLink bandwidth, CPU pressure, FSx for Lustre usage, and pod lifecycle from a single Grafana dashboard, with metrics collected across four exporters covering GPU performance, host-level system health, network fabric, and Kubernetes object state. In addition, curated logs are automatically made available in these dashboards, covering epoch progress, step-level training logs, pipeline errors, and Python tracebacks, so you can quickly diagnose training failures. HyperPod Observability for Restricted Instance Group is automatically enabled when you create a new cluster using RIGs, or can be enabled for existing clusters in a few clicks in the HyperPod cluster management console.
Amazon SageMaker HyperPod RIG observability is available in all AWS Regions where SageMaker HyperPod RIG is supported. To learn more, visit the documentation.
Categories: marketing:marchitecture/artificial-intelligence
Source: Amazon Web Services
Latest Posts
- Amazon SageMaker HyperPod now provides comprehensive observability for Restricted Instance Groups

- Cloudflare Tunnel, Cloudflare Tunnel for SASE – Cloudflare Tunnel and Networks API will no longer return deleted resources by default starting December 1, 2025

- Workers AI – New conversion options for Markdown Conversion

- Browser Rendering – Browser Rendering: 3x higher REST API request rate





