Amazon SageMaker HyperPod now supports checkpointless training

Modern Workspace Pro 3 December 2025No CommentsAmazon Web Services

Amazon SageMaker HyperPod now supports checkpointless training, a new foundational model training capability that mitigates the need for a checkpoint-based job-level restart for fault recovery. Checkpointless training maintains forward training momentum despite failures, reducing recovery time from hours to minutes. This represents a fundamental shift from traditional checkpoint-based recovery, where failures require pausing the entire training cluster, diagnosing issues manually, and restoring from saved checkpoints, a process that can leave expensive AI accelerators idle for hours, costing your organization wasted compute.

Checkpointless training transforms this paradigm by preserving the model training state across the distributed cluster, automatically swapping out faulty training nodes on the fly and using peer-to-peer state transfer from healthy accelerators for failure recovery. By mitigating checkpoint dependencies during recovery, checkpointless training can help your organization save on idle AI accelerator costs and accelerate time. Even at larger scales, checkpointless training on Amazon SageMaker HyperPod enables upwards of 95% training goodput on cluster sizes with thousands of AI accelerators.

Checkpointless training on SageMaker HyperPod is available in all AWS Regions where Amazon SageMaker HyperPod is currently available. You can enable checkpointless training with zero code changes using HyperPod recipes for popular publicly available models such as Llama and GPT OSS. For custom model architectures, you can integrate checkpointless training components with minimal modifications for PyTorch-based workflows, making it accessible to your teams regardless of their distributed training expertise.

To get started, visit the Amazon SageMaker HyperPod product page and see the checkpointless training GitHub page for implementation guidance.

Categories: marketing:marchitecture/artificial-intelligence

Source: Amazon Web Services

Latest Posts

Pass It On

Comments

No comments yet. Why don’t you start the discussion?

Latest Posts

Comments

Leave a Reply Cancel reply