One factor to notice is that customized pallas kernels don’t but help scan
. Here’s a full instance of utilizing scan_layers
in an LLM for reference.
Host offloading
One other highly effective device for reminiscence optimization in PyTorch/XLA is host offloading. This system means that you can briefly transfer tensors from the TPU to the host CPU’s reminiscence, liberating up invaluable gadget reminiscence throughout coaching. That is particularly useful for giant fashions the place reminiscence strain is a priority. You should use torch_xla.experimental.stablehlo_custom_call.place_to_host
to dump a tensor and torch_xla.experimental.stablehlo_custom_call.place_to_device
to retrieve it later. A typical use case entails offloading intermediate activations through the ahead move after which bringing them again through the backward move. Right here’s an instance of host offloading for reference.
Strategic use of host offloading, similar to once you’re working with restricted reminiscence and are unable to make use of the accelerator constantly, might considerably enhance your means to coach giant and complicated fashions inside the reminiscence constraints of your {hardware}.
Different base Docker picture
Have you ever ever encountered a scenario the place your TPUs are sitting idle whereas your host CPU is closely loaded tracing your mannequin execution graph for just-in-time compilation? This implies your mannequin is “tracing certain,” that means efficiency is proscribed by the velocity of tracing operations.
The C++11 ABI picture gives an answer. Beginning with this launch, PyTorch/XLA gives a selection of C++ ABI flavors for each Python wheels and Docker pictures. This provides you a selection for which model of C++ you’d like to make use of with PyTorch/XLA. You will now discover builds with each the pre-C++11 ABI, which stays the default to match PyTorch upstream, and the extra fashionable C++11 ABI.
Switching to the C++11 ABI wheels or Docker pictures can result in noticeable enhancements within the above-mentioned eventualities. For instance, we noticed a 20% relative enchancment in goodput with the Mixtral 8x7B mannequin on v5p-256 Cloud TPU (with a worldwide batch measurement of 1024) once we switched from the pre-C++11 ABI to the C++11 ABI! ML Goodput provides us an understanding of how effectively a given mannequin makes use of the {hardware}. So if we’ve a better goodput measurement for a similar mannequin on the identical {hardware}, that signifies higher efficiency of the mannequin.
An instance of utilizing a C++11 ABI docker picture in your Dockerfile may look one thing like: