Today, Amazon EKS announces support for up to 100,000 worker nodes in a cluster, enabling you to run ultra scale AI/ML training and inference workloads in a single cluster. With Amazon EC2’s new generation accelerated computing instance types, 100,000 worker nodes support up to 1.6 million Trainium chips with Trn2 instances and 800,000 NVIDIA GPUs with P5 and P6 instances in a single cluster. This enables you to run ultra scale AI/ML workloads that require all compute accelerators to be available within a single cluster, as these workloads cannot be easily distributed across multiple clusters.
The most advanced AI models with trillions of parameters demonstrate significantly enhanced capabilities in understanding context, reasoning, and solving complex tasks. To build and operate these increasingly powerful models, organizations require access to massive numbers of compute accelerators in a single cluster. Consolidated access to such a large pool of compute accelerators delivers crucial benefits: allows organizations to build and deploy more powerful AI models than ever before, reduces costs by efficiently sharing compute accelerators between training and inference workloads, and enables seamless use of existing AI/ML tools and frameworks that are not designed to work across clusters.
To learn more, see the launch blog.
Categories:
Source: Amazon Web Services