Amazon SageMaker HyperPod now supports checkpointless training, a new foundational model training capability that mitigates the need for a checkpoint-based job-level restart for fault recovery. Checkpointless training maintains forward training momentum despite failures, reducing recovery time from hours to minutes. This represents a fundamental shift from traditional checkpoint-based recovery, where failures require pausing the entire training cluster, diagnosing issues manually, and restoring from saved checkpoints, a process that can leave expensive AI accelerators idle for hours, costing your organization wasted compute.
Checkpointless training transforms this paradigm by preserving the model training state across the distributed cluster, automatically swapping out faulty training nodes on the fly and using peer-to-peer state transfer from healthy accelerators for failure recovery. By mitigating checkpoint dependencies during recovery, checkpointless training can help your organization save on idle AI accelerator costs and accelerate time. Even at larger scales, checkpointless training on Amazon SageMaker HyperPod enables upwards of 95% training goodput on cluster sizes with thousands of AI accelerators.
Checkpointless training on SageMaker HyperPod is available in all AWS Regions where Amazon SageMaker HyperPod is currently available. You can enable checkpointless training with zero code changes using HyperPod recipes for popular publicly available models such as Llama and GPT OSS. For custom model architectures, you can integrate checkpointless training components with minimal modifications for PyTorch-based workflows, making it accessible to your teams regardless of their distributed training expertise.
To get started, visit the Amazon SageMaker HyperPod product page and see the checkpointless training GitHub page for implementation guidance.
Categories: marketing:marchitecture/artificial-intelligence
Source: Amazon Web Services
Latest Posts
- (Updated) Build SharePoint automations with Workflows—now aligned with the Teams experience [MC1138798]
![(Updated) Build SharePoint automations with Workflows—now aligned with the Teams experience [MC1138798] 2 pexels brett sayles 2516539](data:image/svg+xml;base64,PHN2ZyB3aWR0aD0iMSIgaGVpZ2h0PSIxIiB4bWxucz0iaHR0cDovL3d3dy53My5vcmcvMjAwMC9zdmciPjwvc3ZnPg==)
- (Updated) Microsoft 365 Copilot: Navigation refresh in the M365 Copilot app [MC1187677]
![(Updated) Microsoft 365 Copilot: Navigation refresh in the M365 Copilot app [MC1187677] 3 pexels pixabay 158163](data:image/svg+xml;base64,PHN2ZyB3aWR0aD0iMSIgaGVpZ2h0PSIxIiB4bWxucz0iaHR0cDovL3d3dy53My5vcmcvMjAwMC9zdmciPjwvc3ZnPg==)
- (Updated) Outlook: Support for recommended and automatically applied sensitivity labels in Outlook for iOS and Android [MC1247891]
![(Updated) Outlook: Support for recommended and automatically applied sensitivity labels in Outlook for iOS and Android [MC1247891] 4 yellow 8622786 1920](data:image/svg+xml;base64,PHN2ZyB3aWR0aD0iMSIgaGVpZ2h0PSIxIiB4bWxucz0iaHR0cDovL3d3dy53My5vcmcvMjAwMC9zdmciPjwvc3ZnPg==)
- (Updated) Microsoft Teams: Apps now supported in Private Channels [MC1197145]
![(Updated) Microsoft Teams: Apps now supported in Private Channels [MC1197145] 5 pexels googledeepmind 25626433](data:image/svg+xml;base64,PHN2ZyB3aWR0aD0iMSIgaGVpZ2h0PSIxIiB4bWxucz0iaHR0cDovL3d3dy53My5vcmcvMjAwMC9zdmciPjwvc3ZnPg==)

![(Updated) Build SharePoint automations with Workflows—now aligned with the Teams experience [MC1138798] 2 pexels brett sayles 2516539](https://mwpro.co.uk/wp-content/uploads/2024/08/pexels-brett-sayles-2516539-150x150.webp)
![(Updated) Microsoft 365 Copilot: Navigation refresh in the M365 Copilot app [MC1187677] 3 pexels pixabay 158163](https://mwpro.co.uk/wp-content/uploads/2024/08/pexels-pixabay-158163-150x150.webp)
![(Updated) Outlook: Support for recommended and automatically applied sensitivity labels in Outlook for iOS and Android [MC1247891] 4 yellow 8622786 1920](https://mwpro.co.uk/wp-content/uploads/2025/06/yellow-8622786_1920-150x150.webp)
![(Updated) Microsoft Teams: Apps now supported in Private Channels [MC1197145] 5 pexels googledeepmind 25626433](https://mwpro.co.uk/wp-content/uploads/2024/08/pexels-googledeepmind-25626433-150x150.webp)
