-
Pretraining Data Mixtures Enable Narrow Model Selection Capabilities in Transformer Models
Paper • 2311.00871 • Published • 2 -
Can large language models explore in-context?
Paper • 2403.15371 • Published • 32 -
Data Distributional Properties Drive Emergent In-Context Learning in Transformers
Paper • 2205.05055 • Published • 2 -
Long-context LLMs Struggle with Long In-context Learning
Paper • 2404.02060 • Published • 36
Collections
Discover the best community collections!
Collections including paper arxiv:2404.07129
-
Wide Residual Networks
Paper • 1605.07146 • Published • 2 -
Characterizing signal propagation to close the performance gap in unnormalized ResNets
Paper • 2101.08692 • Published • 2 -
Pareto-Optimal Quantized ResNet Is Mostly 4-bit
Paper • 2105.03536 • Published • 2 -
When Vision Transformers Outperform ResNets without Pre-training or Strong Data Augmentations
Paper • 2106.01548 • Published • 2
-
Resonance RoPE: Improving Context Length Generalization of Large Language Models
Paper • 2403.00071 • Published • 22 -
Scaling Laws of RoPE-based Extrapolation
Paper • 2310.05209 • Published • 7 -
Reka Core, Flash, and Edge: A Series of Powerful Multimodal Language Models
Paper • 2404.12387 • Published • 38 -
OpenELM: An Efficient Language Model Family with Open-source Training and Inference Framework
Paper • 2404.14619 • Published • 126
-
Linear Transformers with Learnable Kernel Functions are Better In-Context Models
Paper • 2402.10644 • Published • 79 -
GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints
Paper • 2305.13245 • Published • 5 -
ChunkAttention: Efficient Self-Attention with Prefix-Aware KV Cache and Two-Phase Partition
Paper • 2402.15220 • Published • 19 -
Sequence Parallelism: Long Sequence Training from System Perspective
Paper • 2105.13120 • Published • 5