Uncovering the Groundbreaking Advantages of Multi-Step Forecasting in LLMs for Enterprise Advancements

The ability to use more complex LLMs at a fraction of the current cost and time opens up new possibilities in AI applications and services. These innovations in LLMs mark a pivotal moment for enterprises, as the substantial improvements in efficiency and cost reduction, powered by advanced GPU hardware, lay the foundation for the next generation of end-to-end enterprise AI solutions.
By Authors: AI team
Poster
-
5
min read

Date: November 28, 2023

Optimizing LLMs: Leveraging GPU Power and Memory Management

With the emergence of large language models (LLMs), many cloud companies are racing to provide these applications as hosted services. However, running these applications is very expensive, requiring a significant number of hardware accelerators such as GPUs. Recent estimates suggest that processing an LLM request can be up to 10 times more expensive than a traditional keyword query. Given these high costs, increasing the throughput—and thereby reducing the cost per request—of LLM serving systems is becoming increasingly important.

At the core of LLMs lies a sequential generation process, which makes inference memory-bound, underutilizing the computational power of GPUs and limiting the serving throughput. For Transformers, key and value tensors associated with the attention mechanism can be implemented in a way that achieves near-zero waste in KV cache memory. Evaluations using this block-level memory management and preemptive request scheduling show an improvement in LLM serving throughput by 2-4 times compared to state-of-the-art systems, without affecting model accuracy. These improvements are more pronounced with longer sequences, larger models, and more complex decoding algorithms. This attention algorithm, which stores non-contiguous paged memory, has been shown to substantially outperform previous state-of-the-art solutions such as Faster Transformer and Orca.

We have built upon this with speculative decoding. Speculative decoding algorithms use a second model to refine the speed and quality of the output generated. This is why parallel decoding was particularly intriguing to us at Chima, as it has proven to speed up the inference process by processing multiple disjoint n-grams in parallel. This approach eliminates the need for a draft model and lays the groundwork for training LLM agents in the future using multi-step parallel policies with Reinforcement Learning. Instead of generating tokens in a supervised manner, allowing the LLM to generate “actions” as any available token from the vocabulary is the way forward for LLMs.

Demonstrating the Efficiency of Combined Decoding Techniques

To advance this field, we combined caching and parallel decoding techniques to compare inference times for “facebook/opt-125m” using four different techniques: naive Hugging Face inference, parallel decoding algorithm, KV cache algorithm, and a combination of KV cache and parallel decoding algorithms. The latency reduction achieved is nearly 6x (5.76x) when comparing naive Hugging Face inference with our approach that combines both KV-cache and Parallel Decoding methods and approximately 10% higher throughput when compared to a KV-cache inference alone.

A New Era in end-to-end Enterprise AI

This not only demonstrates the effectiveness of these techniques but also paves the way for training LLM models in the future. Our work with enterprise deployment of LLMs here – shows an almost 6x improvement over the naive method of inference exposed in Hugging Face today. In conclusion, inspired by recent breakthroughs in large language model inferencing, we demonstrate how large enterprises can significantly increase throughput—and thereby reduce the cost per request—of LLM serving systems with a unique combination of groundbreaking inference techniques. The integration of this advanced inferencing approach offers a sustainable, cost-effective solution for large-scale LLM deployment. These innovations in LLMs mark a pivotal moment for enterprises, as the substantial improvements in efficiency and cost reduction, powered by advanced hardware like GPUs, lay the foundation for the next generation of end-to-end enterprise AI solutions.

Let us securely and at scale build your generative AI.

Schedule AI briefing
Keep Reading