Coaching a frontier mannequin is extremely compute-intensive, requiring a distributed system of lots of, or 1000’s, of accelerated cases operating for a number of weeks or months to finish a single job. For instance, pre-training the Llama 3 70B mannequin with 15 trillion coaching tokens took 6.5 million H100 GPU hours. On 256 Amazon EC2 P5 cases (p5.48xlarge, every with 8 NVIDIA H100 GPUs), this could take roughly 132 days.
Distributed coaching workloads run in a synchronous method as a result of every coaching step requires all collaborating cases to finish their calculations earlier than the mannequin can advance to the following step. It implies that if a single occasion fails, it stops the whole job. As cluster sizes develop, the probability of failure will increase as a result of variety of {hardware} parts concerned. Every {hardware} failure can lead to wasted GPU hours and requires precious engineering time to establish and resolve the problem, making the system liable to downtime that may disrupt progress and delay completion. To evaluate system reliability, engineering groups typically depend on key metrics reminiscent of imply time between failures (MTBF), which measures the common operational time between {hardware} failures and serves as a precious indicator of system robustness.
On this put up, we discover the challenges of large-scale frontier mannequin coaching, specializing in {hardware} failures and the advantages of Amazon SageMaker HyperPod—a resilient answer that minimizes disruptions, enhances effectivity, and reduces coaching prices.
Occasion failure price
To know the everyday MTBF for large-scale frontier mannequin coaching, it helps to first perceive occasion failure charges by reviewing three noteworthy examples:
- When coaching OPT-175B on 992 A100 GPUs, Meta AI encountered important {hardware} reliability challenges. Throughout 2 months, the group managed 35 handbook restarts and cycled over 100 hosts because of {hardware} points, and automatic programs triggered greater than 70 restarts. Working 124 cases (every with 8 GPUs) constantly over 1,440 hours, Meta gathered a complete of 178,560 instance-hours. The noticed failure price throughout this era was round 0.0588% per instance-hour, underscoring the reliability hurdles in coaching massive frontier fashions at this scale.
- In the course of the coaching of Llama 3.1 405B on 16,000 H100 GPUs, a complete of 417 unscheduled {hardware} failures occurred throughout a 54-day interval. This interprets to an efficient failure price of about 0.0161% per instance-hour.
- MPT-7B was educated on 1 trillion tokens over the course of 9.5 days on 440 x A100-40GB. Throughout this era, the coaching job skilled 4 {hardware} failures, leading to an efficient failure price of roughly 0.0319% per instance-hour.
Based mostly on these examples, it’s life like to anticipate that in a single hour of large-scale distributed coaching, an occasion will fail about 0.02%–0.06% of the time.
Bigger clusters, extra failures, smaller MTBF
As cluster dimension will increase, the entropy of the system will increase, leading to a decrease MTBF. The next desk illustrates how the MTBF (in hours) modifications with the variety of cases in a cluster and the estimated failure price for every occasion. For instance, with a 0.04% per-hour failure price per occasion, a 512-instance system is predicted to expertise a failure roughly each 5 hours. The next desk exhibits MTBF (in hours) by failure charges.
. | Measurement of cluster (cases) | |||||||
Failure price (per occasion per hour) | 4 | 8 | 16 | 32 | 64 | 128 | 256 | 512 |
0.01% | 2500 | 1250 | 625 | 313 | 157 | 79 | 40 | 20 |
0.02% | 1250 | 625 | 313 | 157 | 79 | 40 | 20 | 10 |
0.04% | 625 | 313 | 157 | 79 | 40 | 20 | 10 | 5 |
0.08% | 313 | 157 | 79 | 40 | 20 | 10 | 5 | 3 |
Desk 1: The change in MTBF (in hours) with the variety of cases in a coaching cluster (with assumed failure charges within the columns)
What occurs after a failure?
In an ideal world, with out failures, the coaching job proceeds as proven within the following graph, which illustrates the entire coaching time with out failures, demonstrating a linear development.

Determine 1: Coaching is linear in an ideal world with out failures, since there are not any interruptions to completion.
Nonetheless, as beforehand famous, {hardware} failures are inevitable. Troubleshooting these failures sometimes entails a number of steps:
- Root trigger evaluation (imply time to detect) – Figuring out {hardware} failures as the foundation trigger of coaching interruptions will be time-consuming, particularly in complicated programs with a number of potential failure factors. The time taken to find out the foundation trigger is known as imply time to detect (MTTD).
- {Hardware} restore or alternative (imply time to switch) – Generally, a easy occasion restart resolves the problem. At different occasions, the occasion should be changed, which may contain logistical delays, particularly if specialised parts aren’t available. If a alternative occasion isn’t available when a GPU fails, the system should anticipate one to turn into obtainable. Frequent redistribution methods, reminiscent of PyTorch FSDP, don’t allow workload redistribution amongst remaining cases.
- System restoration and resumption (imply time to restart) – After resolving {hardware} points and changing the occasion, extra time is required to revive it to its earlier state. The brand new occasion should match the unique configuration, and the whole cluster should load the mannequin weights from the newest saved checkpoint.
Every failure incurs engineering effort to establish its root trigger. When {hardware} points come up, diagnostics verify the issue and isolate the defective occasion, pausing the coaching job and growing downtime. The impression of those failures is illustrated within the following determine and will be empirically measured for giant distributed coaching jobs. The determine outlines the troubleshooting steps that observe a failure.

Determine 2: Impression of failures on a distributed coaching run. As soon as a failure happens, time (idle GPUs) is spent on detecting (MTD), changing (MTT Change), and persevering with (MTR Restart) a coaching run, typically losing time and costly sources.
In a state of affairs the place a distributed coaching job is operating on an Amazon Elastic Kubernetes Service (Amazon EKS) cluster with n reserved cases and an Auto Scaling group set to keep up a minimal of n cases, a {hardware} concern reminiscent of a GPU failure may cause the job to fail. The affected occasion will probably be marked as Unhealthy by a Kubernetes well being monitor reminiscent of Node Downside Detector, and Amazon EKS will try to reschedule the coaching pods to wholesome cases. If no cases have ample sources, the pods stay in a Pending state, and since the occasion rely is proscribed to n, no new occasion will probably be mechanically provisioned.
In such instances, the failed job should be manually recognized by means of pod logs or the Kubernetes API and deleted. The failed occasion additionally must be remoted and terminated manually, both by means of the AWS Administration Console, AWS Command Line Interface (AWS CLI), or instruments like kubectl or eksctl. To revive cluster capability, the person should improve the cluster dimension by modifying the Auto Scaling group or updating the occasion group. After the brand new occasion is provisioned, bootstrapped, and added to the cluster, the coaching job should be restarted manually. If checkpointing is enabled, the job can resume from the final saved state. The general downtime depends upon the time required to provision a brand new occasion and restart the job by rescheduling the pods.
Quicker failure detection (shorter MTTD), shorter alternative occasions (shorter MTTR), and fast resumption will all contribute to lowering complete coaching time. Automating these processes with minimal person intervention is a key benefit of Amazon SageMaker HyperPod.
Amazon SageMaker HyperPod resilient coaching infrastructure
SageMaker HyperPod is a compute setting optimized for large-scale frontier mannequin coaching. This implies customers can construct resilient clusters for machine studying (ML) workloads and develop or fine-tune state-of-the-art frontier fashions, as demonstrated by organizations reminiscent of Luma Labs and Perplexity AI. SageMaker HyperPod runs well being monitoring brokers within the background for every occasion. When it detects a {hardware} failure, SageMaker HyperPod mechanically repairs or replaces the defective occasion and resumes coaching from the final saved checkpoint. This automation alleviates the necessity for handbook administration, which suggests prospects can prepare in distributed settings for weeks or months with minimal disruption. The advantages are significantly important for purchasers deploying many cases (larger than 16) in a cluster.
Frontier mannequin builders can additional improve mannequin efficiency utilizing built-in ML instruments inside SageMaker HyperPod. They’ll use Amazon SageMaker AI with MLflow to create, handle, and observe ML experiments, or use Amazon SageMaker AI with TensorBoard to visualise mannequin structure and deal with convergence points. Moreover, integrating with observability instruments reminiscent of Amazon CloudWatch Container Insights, Amazon Managed Service for Prometheus, and Amazon Managed Grafana offers deeper insights into cluster efficiency, well being, and utilization, finally saving precious improvement time. The next determine compares the downtime of an infrastructure system utilizing SageMaker HyperPod versus one with out SageMaker HyperPod.

Determine 3: Evaluating downtime chart from determine 1 with downtime on SageMaker HyperPod. When a failure happens, it’s detected mechanically by HyperPod brokers, and the occasion is changed within the background. Coaching can also be resumed from the newest checkpoint
SageMaker HyperPod reduces the downtime per {hardware} failure by mechanically detecting {hardware} points. When these points are detected, SageMaker HyperPod mechanically replaces the defective node(s) and resumes your coaching job from the newest checkpoint, assuming that checkpoints are written.
To judge this, we performed experiments on SageMaker HyperPod utilizing totally different cluster sizes of p5.48xlarge cases. The leads to the next desk, displaying empirical measurements of time to renew by cluster dimension, shows the ninetieth percentile (P90), which represents a worth that will probably be met or exceeded 90% of the time.
Cluster dimension (variety of cases) | P90 time to detect (in seconds) | P90 time to switch (in seconds) | P90 time to renew (in seconds) | Complete downtime per failure (in seconds) | Complete downtime per failure (in minutes) |
16 | 83 | 912 | 1212 | 2207 | 36.8 |
64 | 90 | 963 | 1320 | 2373 | 39.6 |
256 | 89 | 903 | 1398 | 2390 | 39.8 |
1024 | 80 | 981 | 1440 | 2501 | 41.7 |
Desk 2: MTTResume (in seconds) on clusters with totally different sizes
As proven, the imply time to switch an occasion is unbiased of cluster dimension. For a cluster of 256 x p5.48xlarge cases coaching Meta Llama 3.1 70B parameter mannequin with batch dimension = 8, changing an occasion takes about 940 seconds (or 15.7 minutes). After alternative, the brand new occasion should set up extra packages utilizing lifecycle scripts and run deep well being checks earlier than studying from the newest saved checkpoint. When it’s operational, the coaching job resumes from the newest checkpoint, minimizing progress loss regardless of the interruption. For a 256-instance cluster, it took us about 2,390 seconds (about 40 minutes) to mechanically resume the coaching job after every failure.
With out SageMaker HyperPod, when a GPU failure happens throughout a coaching job, the time it takes to renew the coaching can fluctuate broadly relying on the infrastructure and processes in place. With correct check-pointing, automated job orchestration, and environment friendly {hardware} provisioning, the resume time will be lowered. Nonetheless, with out these optimizations, the impression will be rather more extreme. Empirical proof from buyer experiences—together with a number one open supply frontier mannequin supplier, a high massive language mannequin (LLM) startup, an AI firm specializing in enterprise frontier fashions, and a cutting-edge scientific analysis institute—signifies that with out SageMaker HyperPod, the entire downtime per GPU failure can common roughly 280 minutes per failure. Thus, Amazon SageMaker HyperPod saves about 240 minutes (or about 4 hours) of downtime per failure:
. | With out SageMaker HyperPod (in minutes) | With SageMaker HyperPod (in minutes) |
Imply time to root-cause | 10 | 1.5 |
Imply time to switch | 240 | 15 |
Imply time to renew | 30 | 23.5 |
Complete downtime per failure | 280 | 40 |
Desk 3: Typical failure numbers, in minutes (as described in part “What occurs after a failure?” with and with out SageMaker HyperPod)
Quantifying the downtime financial savings
Relying on the frequency of failures, we will calculate the time to coach and the fee financial savings of utilizing SageMaker HyperPod. As an example this calculation, we assume it takes 40 minutes to switch an occasion with SageMaker HyperPod in comparison with 280 minutes with out it (as beforehand defined). Moreover, for this calculation, let’s assume a coaching job requiring 10 million GPU hours on H100 cases, operating on a 256-instance P5 cluster.
Though the precise overhead (in hours) depends upon the dimensions of the coaching job, the relative overhead stays fixed. The advantages of SageMaker HyperPod in lowering complete coaching time are demonstrated within the following chart. For instance, in a 256-instance cluster with a failure price of 0.05%, SageMaker HyperPod reduces complete coaching time by 32%.
. | Measurement of cluster (cases) | |||||||
Failure price (per occasion per hour) |
4 | 8 | 16 | 32 | 64 | 128 | 256 | 512 |
0.01% | 0% | 0% | 1% | 1% | 2% | 5% | 9% | 17% |
0.02% | 0% | 1% | 1% | 2% | 5% | 9% | 17% | 28% |
0.05% | 1% | 2% | 3% | 6% | 11% | 20% | 32% | 48% |
0.07% | 1% | 2% | 4% | 8% | 15% | 25% | 40% | 55% |
Desk 4: Complete % of coaching time lowered by SageMaker HyperPod in comparison with a P5 cluster of comparable dimension
To translate this into precise financial savings, for a coaching job requiring 10 million GPU hours on a 256-instance cluster, SageMaker HyperPod saves 104 days of coaching time. Consequently, prospects can scale back time-to-market by 3.5 months. With out SageMaker HyperPod, the entire time to coach can be roughly 325 days, 121 of that are simply spent on isolating and mitigating {hardware} points. The next desk exhibits the time to coach advantages.
H100 GPU hours for coaching | 10,000,000 |
Variety of cases | 256 |
Failure price (per occasion per hour) | 0.05% |
Further time to repair per failure (hours) | 4 |
Days misplaced because of {hardware} points (with SageMaker HyperPod) | 17 |
Days misplaced because of {hardware} points (with out SageMaker HyperPod) | 121 |
Time to coach with SageMaker HyperPod (days) | 221 |
Time to coach with out SageMaker HyperPod (days) | 325 |
SageMaker HyperPod enchancment | 32% |
Time saved with SageMaker HyperPod (days) | 104 |
Desk 5: Advantages introduced by SageMaker HyperPod for a coaching run requiring 10 million GPU hours and a 256 occasion cluster. SageMaker HyperPod saves 104 days of coaching time total, leading to a quicker time to market (by 3.5 months!)
For a similar instance, we will estimate the entire value financial savings utilizing:
Days misplaced because of {hardware} points = (Variety of cases) × (Failure price per occasion per hour) × (24 hours per day) × (Complete coaching days) × (Downtime per failure in hours)
The next exhibits value to coach advantages.
H100 GPU hours for coaching | 10,000,000 |
Variety of cases | 256 |
Failure price (per occasion per hour) | 0.05% |
Time saved with SageMaker HyperPod (days) | 104 |
Value per GPU per hour | $5 |
Complete value saving with SageMaker HyperPod | $25,559,040 |
Desk 6: Utilizing the calculation described above, the fee to coach advantages laid out for a coaching run requiring 10 million GPU hours, 256 GPU based mostly cases, and an assumed failure price of 0.05% per occasion per hour
A coaching job requiring 10 million GPU hours and 104 extra days of resolving {hardware} points leads to important idle cluster time. Assuming a GPU value of $5 per hour (equal to the worth of P5 cases on Capability Blocks for ML), the entire value financial savings with SageMaker HyperPod quantities to $25,559,040.
Abstract
Coaching frontier fashions is a fancy, resource-intensive course of that’s significantly susceptible to {hardware} failures. On this put up, we explored the occasion failure price, which may vary about 0.02%–0.07% per hour throughout large-scale distributed coaching. As cluster dimension grows, the probability of failures will increase, and the MTBF decreases. We additionally examined what occurs after failure, together with root trigger evaluation, {hardware} restore or alternative, and system restoration and resumption.
Subsequent, we examined Amazon SageMaker HyperPod—a purpose-built, absolutely resilient cluster for frontier mannequin coaching. By incorporating sturdy fault-tolerance mechanisms and automatic well being monitoring, SageMaker HyperPod minimizes disruptions brought on by {hardware} points. This not solely streamlines the coaching course of but in addition enhances the reliability and effectivity of mannequin improvement, enabling quicker and more practical innovation supply. The advantages are measurable and correlate with each cluster dimension and failure price. For a 256-instance cluster with a 0.05% per-instance-per-hour failure price, SageMaker HyperPod reduces complete coaching time by 32%, leading to an approximate financial savings of $25.6 million in complete coaching prices.
By addressing the reliability challenges of frontier mannequin coaching, SageMaker HyperPod permits ML groups to deal with mannequin innovation slightly than infrastructure administration. Organizations can now conduct lengthy coaching runs with confidence, realizing that {hardware} failures will probably be mechanically detected and resolved with minimal disruption to their ML workloads. Get began with Amazon SageMaker HyperPod.
Particular because of Roy Allela, Senior AI/ML Specialist Options Architect for his help on the launch of this put up.
In regards to the Authors
Anoop Saha is a Sr GTM Specialist at Amazon Internet Companies (AWS) specializing in generative AI mannequin coaching and inference. He companions with high frontier mannequin builders, strategic prospects, and AWS service groups to allow distributed coaching and inference at scale on AWS and lead joint GTM motions. Earlier than AWS, Anoop held a number of management roles at startups and enormous companies, primarily specializing in silicon and system structure of AI infrastructure.
Trevor Harvey is a Principal Specialist in generative AI at Amazon Internet Companies (AWS) and an AWS Licensed Options Architect – Skilled. Trevor works with prospects to design and implement machine studying options and leads go-to-market methods for generative AI companies.
Aman Shanbhag is a Specialist Options Architect on the ML Frameworks group at Amazon Internet Companies (AWS), the place he helps prospects and companions with deploying ML coaching and inference options at scale. Earlier than becoming a member of AWS, Aman graduated from Rice College with levels in pc science, arithmetic, and entrepreneurship.