Today, at its re:Invent conference, Amazon's AWS division proudly announced the launch of the SageMaker HyperPod. This new, purpose-built service is designed for the training and fine-tuning of large language models, or LLMs, and it's now available for general use. Amazon has placed a big bet on SageMaker, its platform for building, training, and deploying machine learning models, positioning it as the backbone of its machine learning strategy. With the rise of generative AI, it's not surprising that Amazon is leveraging SageMaker as the key tool for training and fine-tuning LLMs.
Ankur Mehrotra, AWS' general manager for SageMaker, explained the benefits of SageMaker HyperPod in an interview. He emphasized its capacity to create a distributed cluster with accelerated instances, optimized for disputed training. This technology allows for efficient distribution of models and data across the cluster, significantly speeding up the training process. SageMaker HyperPod also enables users to save checkpoints frequently, allowing them to pause and optimize their training process without starting over. This feature, along with the service's fail-safes to handle GPU failures, makes the training process more resilient. Mehrotra highlighted that these capabilities could lead to up to 40% faster training of foundation models, which is a significant differentiator in terms of cost and time to market.
Users of the SageMaker HyperPod can choose to train on Amazon’s custom Trainium chips, including the new Trainium 2, or opt for Nvidia-based GPU instances, such as those using the H100 processor. The flexibility in hardware choice, combined with the service's design, promises to speed up the training process by as much as 40%.
Amazon's experience in building LLMs with SageMaker is evident in their Falcon 180B model, which was trained on a cluster of thousands of A100 GPUs. This experience, combined with AWS's history of scaling SageMaker, has been instrumental in developing HyperPod.
Perplexity AI, an early user of the service, initially doubted AWS's infrastructure capabilities for large model training. However, their skepticism was quickly dispelled after they tested the service. Aravind Srinivas, co-founder and CEO of Perplexity AI, noted the ease of getting support from AWS and the adequate GPU availability. Srinivas also appreciated AWS's focus on speeding up the interconnects linking Nvidia’s graphics cards, a crucial factor for efficient training of large models.
Further details about SageMaker HyperPod's capabilities include its ability to reduce the time to train foundation models by providing infrastructure for distributed training at scale. It allows for extended training periods, and SageMaker actively monitors the cluster's health, replacing faulty nodes and resuming training from checkpoints. The clusters come preconfigured with SageMaker's distributed training libraries, helping to split training data and models across nodes for parallel processing.
Users can get started with SageMaker HyperPod by creating and managing clusters through the AWS Management Console or the AWS Command Line Interface (CLI). They can configure instance groups with desired instance types and allocate the number of instances to each group. Additionally, users need to prepare and upload lifecycle scripts to their Amazon S3 bucket to run in each instance group during cluster creation. These scripts allow for further customization of the cluster environment.
For workload orchestration, SageMaker HyperPod supports Slurm, an open source cluster management and job scheduling system. Users can install and set up Slurm through lifecycle scripts as part of the cluster creation process. SageMaker HyperPod also supports training in various environments like Conda, venv, Docker, and enroot. Users can prepare models, tokenize datasets, and pre-compile models using ahead-of-time compilation to accelerate training.
Finally, the launch jobs on the cluster use the sbatch command, with SageMaker HyperPod's resiliency features automatically enabling detection of hardware failures, node replacement, and training resumption from checkpoints. For monitoring and profiling model training jobs, SageMaker hosted TensorBoard or other tools can be used. SageMaker HyperPod is available in various AWS regions, including the US, Asia Pacific, and Europe.
In conclusion, Amazon SageMaker HyperPod represents a significant step forward in the training and fine-tuning of large language models, offering enhanced speed, flexibility, and resiliency. Its introduction is a testament to Amazon's commitment to advancing the field of machine learning and generative AI.