Amazon SageMaker HyperPod now provides comprehensive observability for Restricted Instance Groups

Modern Workspace Pro 5 March 2026No CommentsAmazon Web Services

Amazon SageMaker HyperPod now offers comprehensive observability for Restricted Instance Groups (RIG), enabling teams training foundation models with Nova Forge to gain deep visibility into their compute resources and training workloads. This new capability eliminates the manual effort of collecting and correlating metrics across the infrastructure stack, providing a unified view of GPU performance, system health, network throughput, and Kubernetes cluster state through a pre-configured Amazon Managed Grafana dashboard backed by Amazon Managed Service for Prometheus.

You can now monitor GPU utilization, NVLink bandwidth, CPU pressure, FSx for Lustre usage, and pod lifecycle from a single Grafana dashboard, with metrics collected across four exporters covering GPU performance, host-level system health, network fabric, and Kubernetes object state. In addition, curated logs are automatically made available in these dashboards, covering epoch progress, step-level training logs, pipeline errors, and Python tracebacks, so you can quickly diagnose training failures. HyperPod Observability for Restricted Instance Group is automatically enabled when you create a new cluster using RIGs, or can be enabled for existing clusters in a few clicks in the HyperPod cluster management console.

Amazon SageMaker HyperPod RIG observability is available in all AWS Regions where SageMaker HyperPod RIG is supported. To learn more, visit the documentation.

Categories: marketing:marchitecture/artificial-intelligence

Source: Amazon Web Services

Latest Posts

Pass It On

Comments

No comments yet. Why don’t you start the discussion?

Latest Posts

Comments

Leave a Reply Cancel reply