Papers
arxiv:2402.05099

Hydragen: High-Throughput LLM Inference with Shared Prefixes

Published on Feb 7, 2024
· Submitted by akhaliq on Feb 8, 2024
Authors:
,
,

Abstract

Transformer-based large language models (LLMs) are now deployed to hundreds of millions of users. LLM inference is commonly performed on batches of sequences that share a prefix, such as few-shot examples or a chatbot system prompt. Decoding in this large-batch setting can be bottlenecked by the attention operation, which reads large key-value (KV) caches from memory and computes inefficient matrix-vector products for every sequence in the batch. In this work, we introduce Hydragen, a hardware-aware exact implementation of attention with shared prefixes. Hydragen computes attention over the shared prefix and unique suffixes separately. This decomposition enables efficient prefix attention by batching queries together across sequences, reducing redundant memory reads and enabling the use of hardware-friendly matrix multiplications. Our method can improve end-to-end LLM throughput by up to 32x against competitive baselines, with speedup growing with the batch size and shared prefix length. Hydragen also enables the use of very long shared contexts: with a high batch size, increasing the prefix length from 1K to 16K tokens decreases Hydragen throughput by less than 15%, while the throughput of baselines drops by over 90%. Hydragen generalizes beyond simple prefix-suffix decomposition and can be applied to tree-based prompt sharing patterns, allowing us to further reduce inference time on competitive programming problems by 55%.

Community

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Wait. Wait. Wait. Forget faster batches. Doesn't it lead to literal infinite context length?
During the inference, we have KV cache which can be stored in CPU/disk/parallel-universe. and we have Q=1(duh, autoregression).
Then using the formula we pretty much can calculate SDPA by parts never taking from KV cache more than we can chew.
Unless I borked napkin math it checks out and it's trivial to calculate SDPA(Q=1, K1||K2||K3) without need to store K1||K2||K3 in GPU.

·

Yeah I haven't seen this decomposition before, and it looks like incorporating the V directly into the decomposition lets them avoid materializing even the (d x n) attention scores

It sounds like for the batched prefix prefill to work efficiently, your requests all have to come in ~ the same time right? E.g. if you set a batch size of 16, in order to get the faster batch inference, you do need all 16 requests to be ready to go (or you need to wait for enough requests to fill the batch).

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2402.05099 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2402.05099 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2402.05099 in a Space README.md to link it from this page.

Collections including this paper 16