stereoplegic
's Collections
KV Cache
updated
S^{3}: Increasing GPU Utilization during Generative Inference for
Higher Throughput
Paper
•
2306.06000
•
Published
•
1
PyramidInfer: Pyramid KV Cache Compression for High-throughput LLM
Inference
Paper
•
2405.12532
•
Published
SqueezeAttention: 2D Management of KV-Cache in LLM Inference via
Layer-wise Optimal Budget
Paper
•
2404.04793
•
Published
MiniCache: KV Cache Compression in Depth Dimension for Large Language
Models
Paper
•
2405.14366
•
Published
•
1
GEAR: An Efficient KV Cache Compression Recipefor Near-Lossless
Generative Inference of LLM
Paper
•
2403.05527
•
Published
•
1
KV-Runahead: Scalable Causal LLM Inference by Parallel Key-Value Cache
Generation
Paper
•
2405.05329
•
Published
Effectively Compress KV Heads for LLM
Paper
•
2406.07056
•
Published
SinkLoRA: Enhanced Efficiency and Chat Capabilities for Long-Context
Large Language Models
Paper
•
2406.05678
•
Published
Retaining Key Information under High Compression Ratios: Query-Guided
Compressor for LLMs
Paper
•
2406.02376
•
Published
•
1
GoldFinch: High Performance RWKV/Transformer Hybrid with Linear Pre-Fill
and Extreme KV-Cache Compression
Paper
•
2407.12077
•
Published
•
54
RazorAttention: Efficient KV Cache Compression Through Retrieval Heads
Paper
•
2407.15891
•
Published
Beyond KV Caching: Shared Attention for Efficient LLMs
Paper
•
2407.12866
•
Published
Beyond Uniform Query Distribution: Key-Driven Grouped Query Attention
Paper
•
2408.08454
•
Published
Efficient LLM Training and Serving with Heterogeneous Context Sharding
among Attention Heads
Paper
•
2407.17678
•
Published
Post-Training Sparse Attention with Double Sparsity
Paper
•
2408.07092
•
Published
Palu: Compressing KV-Cache with Low-Rank Projection
Paper
•
2407.21118
•
Published
•
1
InfiniGen: Efficient Generative Inference of Large Language Models with
Dynamic KV Cache Management
Paper
•
2406.19707
•
Published
Inference-Friendly Models With MixAttention
Paper
•
2409.15012
•
Published