henern
's Collections
Inference
updated
Dynamic Memory Compression: Retrofitting LLMs for Accelerated Inference
Paper
•
2403.09636
•
Published
•
2
TriForce: Lossless Acceleration of Long Sequence Generation with
Hierarchical Speculative Decoding
Paper
•
2404.11912
•
Published
•
16
Infinite-LLM: Efficient LLM Service for Long Context with DistAttention
and Distributed KVCache
Paper
•
2401.02669
•
Published
•
14
LayerSkip: Enabling Early Exit Inference and Self-Speculative Decoding
Paper
•
2404.16710
•
Published
•
75
Kangaroo: Lossless Self-Speculative Decoding via Double Early Exiting
Paper
•
2404.18911
•
Published
•
29
Octopus v4: Graph of language models
Paper
•
2404.19296
•
Published
•
116
Better & Faster Large Language Models via Multi-token Prediction
Paper
•
2404.19737
•
Published
•
73
Similarity is Not All You Need: Endowing Retrieval Augmented Generation
with Multi Layered Thoughts
Paper
•
2405.19893
•
Published
•
30
Reducing Transformer Key-Value Cache Size with Cross-Layer Attention
Paper
•
2405.12981
•
Published
•
28
LongRAG: Enhancing Retrieval-Augmented Generation with Long-context LLMs
Paper
•
2406.15319
•
Published
•
62
Writing in the Margins: Better Inference Pattern for Long Context
Retrieval
Paper
•
2408.14906
•
Published
•
138