-
S^{3}: Increasing GPU Utilization during Generative Inference for Higher Throughput
Paper • 2306.06000 • Published • 1 -
Fast Distributed Inference Serving for Large Language Models
Paper • 2305.05920 • Published • 1 -
Response Length Perception and Sequence Scheduling: An LLM-Empowered LLM Inference Pipeline
Paper • 2305.13144 • Published • 1 -
Towards MoE Deployment: Mitigating Inefficiencies in Mixture-of-Expert (MoE) Inference
Paper • 2303.06182 • Published • 1
Collections
Discover the best community collections!
Collections including paper arxiv:2312.04985
-
Efficient Memory Management for Large Language Model Serving with PagedAttention
Paper • 2309.06180 • Published • 25 -
LM-Infinite: Simple On-the-Fly Length Generalization for Large Language Models
Paper • 2308.16137 • Published • 39 -
Scaling Transformer to 1M tokens and beyond with RMT
Paper • 2304.11062 • Published • 2 -
DeepSpeed Ulysses: System Optimizations for Enabling Training of Extreme Long Sequence Transformer Models
Paper • 2309.14509 • Published • 17
-
Mamba: Linear-Time Sequence Modeling with Selective State Spaces
Paper • 2312.00752 • Published • 138 -
SparQ Attention: Bandwidth-Efficient LLM Inference
Paper • 2312.04985 • Published • 38 -
Lightning Attention-2: A Free Lunch for Handling Unlimited Sequence Lengths in Large Language Models
Paper • 2401.04658 • Published • 25 -
E^2-LLM: Efficient and Extreme Length Extension of Large Language Models
Paper • 2401.06951 • Published • 25
-
Mamba: Linear-Time Sequence Modeling with Selective State Spaces
Paper • 2312.00752 • Published • 138 -
SparQ Attention: Bandwidth-Efficient LLM Inference
Paper • 2312.04985 • Published • 38 -
Dolma: an Open Corpus of Three Trillion Tokens for Language Model Pretraining Research
Paper • 2402.00159 • Published • 61 -
Neural Network Diffusion
Paper • 2402.13144 • Published • 95
-
FlashDecoding++: Faster Large Language Model Inference on GPUs
Paper • 2311.01282 • Published • 35 -
Co-training and Co-distillation for Quality Improvement and Compression of Language Models
Paper • 2311.02849 • Published • 3 -
Prompt Cache: Modular Attention Reuse for Low-Latency Inference
Paper • 2311.04934 • Published • 28 -
Exponentially Faster Language Modelling
Paper • 2311.10770 • Published • 117
-
MART: Improving LLM Safety with Multi-round Automatic Red-Teaming
Paper • 2311.07689 • Published • 7 -
DiLoCo: Distributed Low-Communication Training of Language Models
Paper • 2311.08105 • Published • 14 -
SparQ Attention: Bandwidth-Efficient LLM Inference
Paper • 2312.04985 • Published • 38 -
Aligning Large Language Models with Counterfactual DPO
Paper • 2401.09566 • Published • 2
-
Prompt Cache: Modular Attention Reuse for Low-Latency Inference
Paper • 2311.04934 • Published • 28 -
Routing to the Expert: Efficient Reward-guided Ensemble of Large Language Models
Paper • 2311.08692 • Published • 12 -
Exponentially Faster Language Modelling
Paper • 2311.10770 • Published • 117 -
Memory Augmented Language Models through Mixture of Word Experts
Paper • 2311.10768 • Published • 16
-
Ziya2: Data-centric Learning is All LLMs Need
Paper • 2311.03301 • Published • 16 -
Co-training and Co-distillation for Quality Improvement and Compression of Language Models
Paper • 2311.02849 • Published • 3 -
MFTCoder: Boosting Code LLMs with Multitask Fine-Tuning
Paper • 2311.02303 • Published • 4 -
ADaPT: As-Needed Decomposition and Planning with Language Models
Paper • 2311.05772 • Published • 10