Large language models (LLMs) are rapidly becoming the backbone of AI-driven applications. Without proper optimization, however, LLMs can be expensive to run, slow to serve, and prone to performance bottlenecks. As the demand for real-time AI applications grows, along comes this comprehensive guide to the complexities of deploying and optimizing LLMs at scale. Authors Chi Wang and Peiheng Hu take a real-world approach backed by practical examples and code, and assemble essential strategies for designing infrastructures that are equal to the demands of modern AI applications. Learn the key principles for designing a model-serving system and how to build a system that meets specific business requirements.
Description:
Large language models (LLMs) are rapidly becoming the backbone of AI-driven applications. Without proper optimization, however, LLMs can be expensive to run, slow to serve, and prone to performance bottlenecks. As the demand for real-time AI applications grows, along comes this comprehensive guide to the complexities of deploying and optimizing LLMs at scale. Authors Chi Wang and Peiheng Hu take a real-world approach backed by practical examples and code, and assemble essential strategies for designing infrastructures that are equal to the demands of modern AI applications. Learn the key principles for designing a model-serving system and how to build a system that meets specific business requirements.