The tempo of innovation in open-source AI is breathtaking, with fashions like Meta’s Llama4 and DeepSeek AI’s DeepSeek. Nevertheless, deploying and optimizing massive, highly effective fashions may be complicated and resource-intensive. Builders and machine studying (ML) engineers want reproducible, verified recipes that articulate the steps for attempting out the fashions on out there accelerators.
Right this moment, we’re excited to announce enhanced help and new, optimized recipes for the most recent Llama4 and DeepSeek fashions, leveraging our cutting-edge AI Hypercomputer platform. AI Hypercomputer helps construct a powerful AI infrastructure basis utilizing a set of purpose-built infrastructure parts which can be designed to work properly collectively for AI workloads like coaching and inference. It’s a systems-level method that attracts from our years of expertise serving AI experiences to billions of customers, and combines purpose-built {hardware}, optimized software program and frameworks, and versatile consumption fashions. Our AI Hypercomputer sources repository on GitHub, your hub for these recipes, continues to develop.
On this weblog, we’ll present you easy methods to entry Llama4 and DeepSeek fashions at the moment on AI Hypercomputer.
Added help for brand new Llama4 fashions
Meta lately launched the Scout and Maverick fashions within the Llama4 herd of fashions. Llama 4 Scout is a 17 billion lively parameter mannequin with 16 specialists, and Llama 4 Maverick is a 17 billion lively parameter mannequin with 128 specialists. These fashions ship improvements and optimizations primarily based on a Combination of Consultants (MoE) structure. They help multimodal functionality and lengthy context size.
However serving these fashions can current challenges when it comes to deployment and useful resource administration. To assist simplify this course of, we’re releasing new recipes for serving Llama4 fashions on Google Cloud Trillium TPUs and A3 Mega and A3 Extremely GPUs.
-
JetStream, Google’s throughput and memory-optimized engine for LLM inference on XLA gadgets, now helps Llama-4-Scout-17B-16E and Llama-4-Maverick-17B-128E inference on Trillium, the sixth-generation TPU. New recipes now present the steps to deploy these fashions utilizing JetStream and MaxText on a Trillium TPU GKE cluster. vLLM is a high-throughput and memory-efficient inference and serving engine for LLMs. New recipes now exhibit easy methods to use vLLM to serve the Llama4 Scout and Maverick fashions on A3 Mega and A3 Extremely GPU GKE clusters.
-
For serving the Maverick mannequin on TPUs, we make the most of Pathways on Google Cloud. Pathways is a system which simplifies large-scale machine studying computations by enabling a single JAX consumer to orchestrate workloads throughout a number of massive TPU slices. Within the context of inference, Pathways permits multi-host serving throughout a number of TPU slices. Pathways is used internally at Google to coach and serve massive fashions like Gemini.
-
MaxText supplies excessive efficiency, extremely scalable, open-source LLM reference implementations for OSS fashions written in pure Python/JAX and concentrating on Google Cloud TPUs and GPUs for coaching and inference. MaxText now consists of reference implementations for Llama4 Scout and Maverick fashions and consists of info on easy methods to carry out checkpoint conversion, coaching, and decoding for Llama4 fashions.