EAGLE: Speculative Sampling Requires Rethinking Feature Uncertainty
Abstract
Auto-regressive decoding makes the inference of Large Language Models (LLMs) time-consuming. We propose a simple framework, EAGLE (Extrapolation Algorithm for Greater Language-model Efficiency), for lossless acceleration. Unlike traditional speculative sampling methods, EAGLE operates the drafting process auto-regressively at the more regular (second-top-layer) feature level and addresses the sampling uncertainty issues in the next-feature prediction problems by integrating tokens from one time step ahead. The acceleration provided by EAGLE is lossless: it involves no fine-tuning of the target LLM, and the generated text maintains the same distribution as that of vanilla auto-regressive decoding. As of the submission of this paper, EAGLE is the fastest known framework within the speculative sampling family. On MT-bench, EAGLE is 3x faster than vanilla decoding, 2x faster than Lookahead, and 1.6x faster than Medusa. Using gpt-fast, EAGLE attains on average 160 tokens/s with LLaMA2-Chat 13B on a single RTX 3090 GPU, compared to 24 tokens/s of Huggingface's implementations.
Community
Instead of employing text generated by the target LLM, we utilize a fixed dataset, substantially reducing the overhead.
Why not try to aim to be more faithful to the target LLM? And how would you feed the second-top-layer feature activation (Figure 7) to the draft model without utilizing the target LLM during training?
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Unlocking Efficiency in Large Language Model Inference: A Comprehensive Survey of Speculative Decoding (2024)
- BiTA: Bi-Directional Tuning for Lossless Acceleration in Large Language Models (2024)
- Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads (2024)
- Cascade Speculative Drafting for Even Faster LLM Inference (2023)
- Multi-Candidate Speculative Decoding (2024)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Can this be combined with the idea from the paper "Accelerating LLM Inference with Staged Speculative Decoding" where a smaller draft model is used to accelerate the process of obtaining predictions from the main draft model?
Or alternatively, combine it with the "Prompt Lookup Decoding" method
Can this be combined with the idea from the paper "Accelerating LLM Inference with Staged Speculative Decoding"
+1, as in the following two ideas from that paper:
- Use a shallow but wide tree to increase parallelism during verification. My reading of EAGLE is that this is already a big part of their speedup when parallel batch verification is enabled (e.g. see Figure 7)
- Having staged speculation within the smaller draft model. I think this is a great idea, the EAGLE draft model can combine with a second lightweight (e.g. n-gram described in your Prompt Lookup Decoding) approach. I believe the thinking goes that the deeper the speculation, the more tentative they are, so invest less resources in computing them.
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper