igormolybog
's Collections
Inference speed
updated
FlashDecoding++: Faster Large Language Model Inference on GPUs
Paper
•
2311.01282
•
Published
•
35
Co-training and Co-distillation for Quality Improvement and Compression
of Language Models
Paper
•
2311.02849
•
Published
•
3
Prompt Cache: Modular Attention Reuse for Low-Latency Inference
Paper
•
2311.04934
•
Published
•
28
Exponentially Faster Language Modelling
Paper
•
2311.10770
•
Published
•
117
SparQ Attention: Bandwidth-Efficient LLM Inference
Paper
•
2312.04985
•
Published
•
38
Transformers are Multi-State RNNs
Paper
•
2401.06104
•
Published
•
36
DeepSpeed-FastGen: High-throughput Text Generation for LLMs via MII and
DeepSpeed-Inference
Paper
•
2401.08671
•
Published
•
14
Medusa: Simple LLM Inference Acceleration Framework with Multiple
Decoding Heads
Paper
•
2401.10774
•
Published
•
54
BiTA: Bi-Directional Tuning for Lossless Acceleration in Large Language
Models
Paper
•
2401.12522
•
Published
•
11
SubGen: Token Generation in Sublinear Time and Memory
Paper
•
2402.06082
•
Published
•
10
Fiddler: CPU-GPU Orchestration for Fast Inference of Mixture-of-Experts
Models
Paper
•
2402.07033
•
Published
•
16
Speculative Streaming: Fast LLM Inference without Auxiliary Models
Paper
•
2402.11131
•
Published
•
42
Towards Fast Multilingual LLM Inference: Speculative Decoding and
Specialized Drafters
Paper
•
2406.16758
•
Published
•
19