“What isn’t measured, can’t be improved.” This quote has turn into a guideline for groups coaching basis fashions. While you’re coping with advanced, large-scale AI programs, issues can spiral rapidly with out the appropriate oversight. Working at hyperscale poses important challenges for groups, from the massive quantity of knowledge generated to the unpredictability of {hardware} failures and the necessity for environment friendly useful resource administration. These points require strategic options, that’s why monitoring isn’t only a nice-to-have—it’s the spine of transparency, reproducibility, and effectivity. Throughout my discuss at NeurIPS, I broke down 5 key classes discovered from groups going through large-scale mannequin coaching and monitoring. Let’s get into it.
Actual-time monitoring prevents pricey failures
Think about this: you’re coaching a big language mannequin on hundreds of GPUs at a price of tons of of hundreds of {dollars} per day. Now think about discovering, hours into coaching, that your mannequin is diverging or that {hardware} points are degrading your efficiency. The monetary and operational implications are staggering. For this reason stay monitoring—the flexibility to behave instantly—is so important.
Dwell monitoring permits groups to see experiment progress because it occurs, somewhat than ready for checkpoints or the top of a run. This real-time visibility is a game-changer for figuring out and fixing issues on the fly. As well as, automated processes help you arrange monitoring workflows as soon as and reuse them for related experiments. This streamlines the method of evaluating outcomes, analyzing outcomes, and debugging points, saving effort and time.
Nonetheless, attaining true stay monitoring is much from easy. Hyperscale coaching generates an awesome quantity of knowledge, typically reaching as much as one million knowledge factors per second. Conventional monitoring instruments wrestle below such hundreds, creating bottlenecks that may delay corrective motion. Some groups attempt to cope by batching or sampling metrics, however these approaches sacrifice real-time visibility and add complexity to the code.
The answer lies in programs that may deal with high-throughput knowledge ingestion whereas offering correct, real-time insights. Instruments like neptune.ai make this doable by offering dashboards that visualize metrics with out delaying coaching. For instance, stay monitoring of GPU utilization or reminiscence utilization can reveal early indicators of bottlenecks or out-of-memory errors, permitting engineers to proactively regulate course. See right here some testimonials:
One factor we’re at all times conserving observe of is what the utilization is and how you can enhance it. Typically, we’ll get, for instance, out-of-memory errors, after which seeing how the reminiscence will increase over time within the experiment is actually useful for debugging as properly.
James Tu
Analysis Scientist, Waabi
For among the pipelines, Neptune was useful for us to see the utilization of the GPUs. The utilization graphs within the dashboard are an ideal proxy for locating some bottlenecks within the efficiency, particularly if we’re working many pipelines.
Wojtek Rosiński
CTO, ReSpo.Imaginative and prescient

Troubleshooting {hardware} failures is difficult: simplify it with debugging
Distributed programs are liable to failure, and {hardware} failures are notoriously tough to troubleshoot. A single {hardware} failure can cascade into widespread outages, typically with cryptic error messages. Groups typically waste time sifting via stack traces, making an attempt to differentiate between infrastructure issues and code bugs.
At Cruise, engineers used frameworks like Ray and Lightning to enhance error reporting. By mechanically labeling errors as both “infra” or “consumer” points and correlating stack traces throughout nodes, debugging turned a lot quicker.
Igor Tsvetkov
Former Senior Workers Software program Engineer, Cruise
AI groups automating error categorization and correlation can considerably scale back debugging time in hyperscale environments, simply as Cruise has carried out. How? By utilizing classification methods to determine if failures originated from {hardware} constraints (e.g., GPU reminiscence leaks, community latency) or software program bugs (e.g., defective mannequin architectures, misconfigured hyperparameters).
Intuitive experiment monitoring optimizes useful resource utilization
One other related facet of hyperscale monitoring is optimizing useful resource utilization, particularly in a situation the place {hardware} failures and coaching interruptions can set groups again considerably. Image a situation the place coaching jobs instantly deviate: loss metrics spike, and also you’re left deciding whether or not to let the job run or terminate it. Superior experiment trackers permit for distant experiment termination, eliminating the necessity for groups to manually entry cloud logs or servers.
Use checkpoints at frequent intervals so that you do not need to restart from scratch, however simply warm-start from the earlier checkpoint. Most mature coaching frameworks already supply automated checkpointing and warm-starts from earlier checkpoints. However most of those, by default, save the checkpoints in the identical machine. This doesn’t assist in case your {hardware} crashes, or, for instance, you’re utilizing spot cases and they’re reassigned.
For optimum resilience and to forestall shedding knowledge if {hardware} crashes, checkpoints must be linked to your experiment tracker. This doesn’t imply that you simply add GBs value of checkpoints to the tracker (though you possibly can and a few of our clients, particularly self-hosted clients, do that for safety causes), however somewhat have tips that could the distant location, like S3, the place the checkpoints have been saved. This allows you to hyperlink the checkpoint with the corresponding experiment step, and effectively retrieve the related checkpoint at any given step.

Nonetheless, there are two caveats to efficiently restarting an experiment from a checkpoint: assuming that the experimentation setting is fixed, or no less than reproducible, and addressing deterministic points like Out-of-Reminiscence errors (OOMs) or bottlenecks that will require parameter adjustments to keep away from repeating failures. That is the place forking can play a major function in enhancing restoration and progress.
Observe months-long mannequin coaching with extra confidence. Use neptune.ai forking function to iterate quicker and optimize the utilization of GPU sources.
With Neptune, customers can visualize forked coaching out of the field. This implies you possibly can:
- Take a look at a number of configs on the similar time. Cease the runs that don’t enhance accuracy. And proceed from essentially the most correct final step.
- Restart failed coaching periods from any earlier step. The coaching historical past is inherited, and all the experiment is seen on a single chart.
As well as, checkpointing methods are important for optimizing restoration processes. Frequent checkpointing ensures minimal lack of progress, permitting you to warm-start from the newest state as a substitute of ranging from scratch. Nonetheless, checkpointing may be resource-intensive when it comes to storage and time, so we have to strike a stability between frequency and overhead.
For giant-scale fashions, the overhead of writing and studying weights to persistent storage can considerably scale back coaching effectivity. Improvements like redundant in-memory copies, as demonstrated by Google’s Gemini fashions, allow speedy restoration and improved coaching goodput (outlined by Google because the time spent computing helpful new steps over the elapsed time of the coaching job), growing resilience and effectivity.
Options like PyTorch Distributed’s asynchronous checkpointing can considerably scale back checkpointing instances making frequent checkpointing extra viable with out compromising coaching efficiency.
Past fashions, checkpointing the state of dataloaders stays a problem attributable to distributed states throughout nodes. Whereas some organizations like Meta have developed in-house options, basic frameworks have but to totally tackle this concern. Incorporating dataloader checkpointing can additional improve resilience by preserving the precise coaching state throughout restoration.
Reproducibility and transparency are non-negotiable
Reproducibility is the bedrock of dependable analysis, nevertheless it’s notoriously tough at scale. Making certain reproducibility requires constant monitoring of setting particulars, datasets, configurations, and outcomes. That is the place Neptune’s strategy excels, linking each experiment’s lineage—from dad or mum runs to dataset variations—in an accessible dashboard.
This transparency not solely aids validation but in addition accelerates troubleshooting. Think about ReSpo.Imaginative and prescient’s challenges in managing and evaluating outcomes throughout pipelines. By implementing organized monitoring programs, they gained visibility into pipeline dependencies and experiment parameters, streamlining their workflow.
A single supply of fact simplifies knowledge visualization and administration at large-scale knowledge
Managing and visualizing knowledge at scale is a standard problem, amplified within the context of large-scale experimentation. Whereas instruments like MLflow or TensorBoard are ample for smaller initiatives with 10–20 experiments, they rapidly fall quick when dealing with hundreds and even tons of of experiments. At this scale, organizing and evaluating outcomes turns into a logistical hurdle, and counting on instruments that can’t successfully visualize or handle this scale results in inefficiencies and missed insights.
An answer lies in adopting a single supply of fact for all experiment metadata, encompassing all the things from enter knowledge and coaching metrics to checkpoints and outputs. Neptune’s dashboards tackle this problem by offering a extremely customizable and centralized platform for experiment monitoring. These dashboards allow real-time visualization of key metrics, which may be tailor-made to incorporate “customized metrics”—these not explicitly logged on the code degree however calculated retrospectively throughout the software. As an example, if a enterprise requirement shifts from utilizing precision and recall to the F1 rating as a efficiency indicator, customized metrics help you calculate and visualize these metrics throughout present and future experiments with out rerunning them, guaranteeing flexibility and minimizing duplicated effort.
Think about the challenges confronted by Waabi and ReSpo.Imaginative and prescient. Waabi’s groups, working large-scale ML experiments, wanted a approach to manage and share their experiment knowledge effectively. Equally, ReSpo.Imaginative and prescient required an intuitive system to visualise a number of metrics in a standardized format that any group member—technical or non-technical—may simply entry and interpret. Neptune’s dashboards supplied the answer, permitting these groups to streamline their workflows by providing visibility into all related experiment knowledge, lowering overhead, and enabling collaboration throughout stakeholders.
I like these dashboards as a result of we want a number of metrics, so that you code the dashboard as soon as, have these kinds, and simply see it on one display. Then, every other individual can view the identical factor, in order that’s fairly good.
Łukasz Grad
Chief Information Scientist, ReSpo.Imaginative and prescient
The advantages of such an strategy prolong past visualization. Logging solely important knowledge and calculating derived metrics throughout the software reduces latency and streamlines the experimental course of. This functionality empowers groups to give attention to actionable insights, enabling scalable and environment friendly experiment monitoring, even for initiatives involving tens of hundreds of fashions and subproblems.
Visualizing giant datasets
We typically don’t consider dataset visualization as a part of experiment monitoring. Nonetheless, getting ready the dataset for mannequin coaching is an experiment in itself, and whereas it might be an upstream experiment not in the identical pipeline because the precise mannequin coaching, knowledge administration and visualization is important to LLMOps.
Massive-scale experiments typically contain processing billions of knowledge factors or embeddings. Visualizing such knowledge to uncover relationships and debug points is a standard hurdle. Instruments like Deepscatter and Jupyter Scatter have made progress in scaling visualizations for large datasets, providing researchers beneficial insights into their knowledge distribution and embedding buildings.
Transferring ahead
The trail to environment friendly hyperscale coaching lies in combining sturdy monitoring, superior debugging instruments, and complete experiment monitoring. Options like Neptune Scale are designed to deal with these challenges, providing the scalability, precision, and transparency researchers want.
How about being one of many first to entry Neptune Scale?
Neptune Scale is our upcoming product launch constructed for groups that practice basis fashions. It affords enhanced scalability and thrilling new options. You’ll be able to be part of our beta program to profit from Neptune Scale earlier.
When you’re thinking about studying extra, go to our weblog or be part of the MLOps neighborhood to discover case research and actionable methods for large-scale AI experimentation.
Acknowledgments
I wish to specific my gratitude to Prince Canuma, Dr. Shantipriya Parida, and Igor Tsvetkov for his or her beneficial time and insightful discussions on this matter. Their contributions and views had been instrumental in shaping this discuss.