Papers
arxiv:2402.15220

ChunkAttention: Efficient Self-Attention with Prefix-Aware KV Cache and Two-Phase Partition

Published on Feb 23, 2024
· Submitted by akhaliq on Feb 26, 2024
Authors:
,
,
,

Abstract

Self-attention is an essential component of large language models(LLMs) but a significant source of inference latency for long sequences. In multi-tenant LLMs serving scenarios, the compute and memory operation cost of self-attention can be optimized by using the probability that multiple LLM requests have shared system prompts in prefixes. In this paper, we introduce ChunkAttention, a prefix-aware self-attention module that can detect matching prompt prefixes across multiple requests and share their key/value tensors in memory at runtime to improve the memory utilization of KV cache. This is achieved by breaking monolithic key/value tensors into smaller chunks and structuring them into the auxiliary prefix tree. Consequently, on top of the prefix-tree based KV cache, we design an efficient self-attention kernel, where a two-phase partition algorithm is implemented to improve the data locality during self-attention computation in the presence of shared system prompts. Experiments show that ChunkAttention can speed up the self-attention kernel by 3.2-4.8times compared to the start-of-the-art implementation, with the length of the system prompt ranging from 1024 to 4096.

Community

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Glad to see ChunkAttn is featured. This is a revised version of the ICLR 2024 submission. The reviews have a few concerns: 1) proof of shared prefix lengths. 2) more experiments. We fixed them and released the current version. Feel free to provide any feedback.
https://openreview.net/forum?id=9k27IITeAZ
The code needs to pass the open-source process. It will be available on Github soon.

·

Is the core idea here:

  1. There are many scenarios where you might end up sharing (potentially very) long prefixes that you need to prefill (e.g. system prompt, exemplars, maybe even an instruction manual with some of the 100K - 1M models).
  2. Under this regime, it's helpful to be able to cache common shared prefixes in the form of the KV cache chunks so they don't have to be recomputed during expensive prefill in, for e.g., disaggregated/batched serving regimes
  3. This paper proposes an approach to identify and store these common prefixes

That seems like a super useful thing. Especially if these chunks could be loaded offline (e.g. downloading a static precomputed dictionary of KV cache chunks for the most common shared chunks or your entire giant system prompt)

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2402.15220 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2402.15220 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2402.15220 in a Space README.md to link it from this page.

Collections including this paper 6