arxiv:2408.14906

Writing in the Margins: Better Inference Pattern for Long Context Retrieval

Published on Aug 27

· Submitted by

kiranr on Aug 28

#1 Paper of the day

Upvote

138

Authors:

Melisa Russak ,

Umar Jamil ,

Kiran Kamble ,

Axel Magnuson ,

Mateusz Russak ,

Waseem AlShikh

Abstract

In this paper, we introduce Writing in the Margins (WiM), a new inference pattern for Large Language Models designed to optimize the handling of long input sequences in retrieval-oriented tasks. This approach leverages the chunked prefill of the key-value cache to perform segment-wise inference, which enables efficient processing of extensive contexts along with the generation and classification of intermediate information ("margins") that guide the model towards specific tasks. This method increases computational overhead marginally while significantly enhancing the performance of off-the-shelf models without the need for fine-tuning. Specifically, we observe that WiM provides an average enhancement of 7.5% in accuracy for reasoning skills (HotpotQA, MultiHop-RAG) and more than a 30.0% increase in the F1-score for aggregation tasks (CWE). Additionally, we show how the proposed pattern fits into an interactive retrieval design that provides end-users with ongoing updates about the progress of context processing, and pinpoints the integration of relevant information into the final response. We release our implementation of WiM using Hugging Face Transformers library at https://github.com/writer/writing-in-the-margins.

View arXiv page View PDF Add to collection

Community

kiranr

Paper author Paper submitter Aug 28

link to the code : https://github.com/writer/writing-in-the-margins

AdinaY

Aug 28

•

edited Aug 28

Congrats on the paper🔥Amazing work!

samjulien

Aug 28

Amazing work, congratulations!

librarian-bot

Aug 29

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

ddiddi

Aug 29

Nicely done! Great paper!

jonah-ramponi

Aug 30

•

edited Aug 30

This is really cool!!

However I have got myself very confused. Can someone explain why this is true, I think I am missing something simple.

By splitting a prompt of length L into N chunks, each
of size K, where N = L/K, the overall memory complexity
of prefilling is reduced from O(L^2) to O(LK).

In the chunked case, am I right to assume our first chunk would have memory cost K * K (?) As each of the K tokens in our chunk attends to the others in said chunk.

For the second chunk, this is now a(2K * K) as our K tokens in this chunk now attend to the prior K tokens also. I think this would continue for ...

Total Cost = Cost Chunk 1 + ... + Cost Chunk N
= K^2 + 2K^2 + ... + NK^2
= K^2(1+...+N)
= K^2 (N)(N+1)/2
= K^2 (L/K)(L/K+1)/2 * by substitution of N = L/K
= K^2 (L^2/K^2 + L/K)/2
= (L^2 + LK)/2

And I am confused why this is O(LK), so I think there must be some fundamental flaw in my understanding

hkproj

Paper author Aug 30

•

edited Aug 30

Since chunked prefill is done "step by step" (each step is a forward pass through the model), the worst case memory complexity is allocated during the last step in which the last chunk is prefilled. For the last step, the memory complexity is the number of tokens in a single chunk (K) multiplied by the total number of tokens in the sequence (L).

Memory is always allocated and subsequently released after each forward pass (except for the KV-Cache and the model's parameters), that's why we don't "accumulate" the total cost over all chunks.

aipapersacademy

Aug 30

A video summary is now available here - https://youtu.be/JODc9ku5djA

melisa

Paper author Sep 2

I love it!

codelion

Oct 5

We actually recently did an independent implementation of this paper in our open-source optimizing llm proxy optillm - https://github.com/codelion/optillm/blob/main/optillm/plugins/memory_plugin.py

We were able to use it as a basis for the memory plugin in optillm that gives LLMs short term memory. It helps improve accuracy on long context retrieval and even enables LLMs to have unbounded context if needed.
We were able to match SOTA on a recent benchmark from Google Frames benchmark (https://huggingface.co/datasets/google/frames-benchmark) with only gpt-4o-mini v/s Gemini 1.5 Flash which has a context length that is 10x more.