Amazon SageMaker HyperPod now supports checkpointless training, a new foundational model training capability that mitigates the need for a checkpoint-based job-level restart for fault recovery. Checkpointless training maintains forward training momentum despite failures, reducing recovery time from hours to minutes. This represents a fundamental shift from traditional checkpoint-based recovery, where failures require pausing the entire training cluster, diagnosing issues manually, and restoring from saved checkpoints, a process that can leave expensive AI accelerators idle for hours, costing your organization wasted compute.
Checkpointless training transforms this paradigm by preserving the model training state across the distributed cluster, automatically swapping out faulty training nodes on the fly and using peer-to-peer state transfer from healthy accelerators for failure recovery. By mitigating checkpoint dependencies during recovery, checkpointless training can help your organization save on idle AI accelerator costs and accelerate time. Even at larger scales, checkpointless training on Amazon SageMaker HyperPod enables upwards of 95% training goodput on cluster sizes with thousands of AI accelerators.
Checkpointless training on SageMaker HyperPod is available in all AWS Regions where Amazon SageMaker HyperPod is currently available. You can enable checkpointless training with zero code changes using HyperPod recipes for popular publicly available models such as Llama and GPT OSS. For custom model architectures, you can integrate checkpointless training components with minimal modifications for PyTorch-based workflows, making it accessible to your teams regardless of their distributed training expertise.
To get started, visit the Amazon SageMaker HyperPod product page and see the checkpointless training GitHub page for implementation guidance.
Categories: marketing:marchitecture/artificial-intelligence
Source: Amazon Web Services
Latest Posts
- Amazon SageMaker HyperPod now supports checkpointless training

- Dynamics 365 Contact Center – Analyze adherence history to optimize workforce planning [MC1189137]
![Dynamics 365 Contact Center - Analyze adherence history to optimize workforce planning [MC1189137] 3 pexels davefilm 2643596](data:image/svg+xml;base64,PHN2ZyB3aWR0aD0iMSIgaGVpZ2h0PSIxIiB4bWxucz0iaHR0cDovL3d3dy53My5vcmcvMjAwMC9zdmciPjwvc3ZnPg==)
- Dynamics 365 Contact Center – Update default Messaging Queues assignment strategy – Least Active [MC1181931]
![Dynamics 365 Contact Center - Update default Messaging Queues assignment strategy - Least Active [MC1181931] 4 nature 3102762 1920](data:image/svg+xml;base64,PHN2ZyB3aWR0aD0iMSIgaGVpZ2h0PSIxIiB4bWxucz0iaHR0cDovL3d3dy53My5vcmcvMjAwMC9zdmciPjwvc3ZnPg==)
- (Updated) Microsoft Edge: Microsoft 365 Copilot will support summarization and contextual grounding [MC1187682]
![(Updated) Microsoft Edge: Microsoft 365 Copilot will support summarization and contextual grounding [MC1187682] 5 pexels eric anada 280222 1495580](data:image/svg+xml;base64,PHN2ZyB3aWR0aD0iMSIgaGVpZ2h0PSIxIiB4bWxucz0iaHR0cDovL3d3dy53My5vcmcvMjAwMC9zdmciPjwvc3ZnPg==)


![Dynamics 365 Contact Center - Analyze adherence history to optimize workforce planning [MC1189137] 3 pexels davefilm 2643596](https://mwpro.co.uk/wp-content/uploads/2025/06/pexels-davefilm-2643596-150x150.webp)
![Dynamics 365 Contact Center - Update default Messaging Queues assignment strategy - Least Active [MC1181931] 4 nature 3102762 1920](https://mwpro.co.uk/wp-content/uploads/2025/06/nature-3102762_1920-150x150.webp)
![(Updated) Microsoft Edge: Microsoft 365 Copilot will support summarization and contextual grounding [MC1187682] 5 pexels eric anada 280222 1495580](https://mwpro.co.uk/wp-content/uploads/2024/08/pexels-eric-anada-280222-1495580-150x150.webp)