arxiv:2401.15077

EAGLE: Speculative Sampling Requires Rethinking Feature Uncertainty

Published on Jan 26, 2024

· Submitted by

akhaliq on Jan 29, 2024

#2 Paper of the day

Upvote

Authors:

Abstract

Auto-regressive decoding makes the inference of Large Language Models (LLMs) time-consuming. We propose a simple framework, EAGLE (Extrapolation Algorithm for Greater Language-model Efficiency), for lossless acceleration. Unlike traditional speculative sampling methods, EAGLE operates the drafting process auto-regressively at the more regular (second-top-layer) feature level and addresses the sampling uncertainty issues in the next-feature prediction problems by integrating tokens from one time step ahead. The acceleration provided by EAGLE is lossless: it involves no fine-tuning of the target LLM, and the generated text maintains the same distribution as that of vanilla auto-regressive decoding. As of the submission of this paper, EAGLE is the fastest known framework within the speculative sampling family. On MT-bench, EAGLE is 3x faster than vanilla decoding, 2x faster than Lookahead, and 1.6x faster than Medusa. Using gpt-fast, EAGLE attains on average 160 tokens/s with LLaMA2-Chat 13B on a single RTX 3090 GPU, compared to 24 tokens/s of Huggingface's implementations.

View arXiv page View PDF Add to collection

Community

samusenps

Jan 29, 2024

🤯

leegao19

Jan 29, 2024

Instead of employing text generated by the target LLM, we utilize a fixed dataset, substantially reducing the overhead.

Why not try to aim to be more faithful to the target LLM? And how would you feed the second-top-layer feature activation (Figure 7) to the draft model without utilizing the target LLM during training?

librarian-bot

Jan 30, 2024

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

timothelaborie

Jan 30, 2024

Can this be combined with the idea from the paper "Accelerating LLM Inference with Staged Speculative Decoding" where a smaller draft model is used to accelerate the process of obtaining predictions from the main draft model?
Or alternatively, combine it with the "Prompt Lookup Decoding" method

leegao19

Jan 31, 2024

Can this be combined with the idea from the paper "Accelerating LLM Inference with Staged Speculative Decoding"

+1, as in the following two ideas from that paper:

Use a shallow but wide tree to increase parallelism during verification. My reading of EAGLE is that this is already a big part of their speedup when parallel batch verification is enabled (e.g. see Figure 7)
Having staged speculation within the smaller draft model. I think this is a great idea, the EAGLE draft model can combine with a second lightweight (e.g. n-gram described in your Prompt Lookup Decoding) approach. I believe the thinking goes that the deeper the speculation, the more tentative they are, so invest less resources in computing them.