Models
Datasets
Spaces
Posts
Docs
Enterprise
Pricing
Log In
Sign Up

Collections

Discover the best community collections!

Collections including paper arxiv:2404.07129

Papers - ICL - In-Context Learning

Pretraining Data Mixtures Enable Narrow Model Selection Capabilities in Transformer Models

Paper • 2311.00871 • Published Nov 1, 2023 • 2
Can large language models explore in-context?

Paper • 2403.15371 • Published Mar 22, 2024 • 32
Data Distributional Properties Drive Emergent In-Context Learning in Transformers

Paper • 2205.05055 • Published Apr 22, 2022 • 2
Long-context LLMs Struggle with Long In-context Learning

Paper • 2404.02060 • Published Apr 2, 2024 • 36

Papers - ResNet

Wide Residual Networks

Paper • 1605.07146 • Published May 23, 2016 • 2
Characterizing signal propagation to close the performance gap in unnormalized ResNets

Paper • 2101.08692 • Published Jan 21, 2021 • 2
Pareto-Optimal Quantized ResNet Is Mostly 4-bit

Paper • 2105.03536 • Published May 7, 2021 • 2
When Vision Transformers Outperform ResNets without Pre-training or Strong Data Augmentations

Paper • 2106.01548 • Published Jun 3, 2021 • 2

Resonance RoPE: Improving Context Length Generalization of Large Language Models

Paper • 2403.00071 • Published Feb 29, 2024 • 22
Scaling Laws of RoPE-based Extrapolation

Paper • 2310.05209 • Published Oct 8, 2023 • 7
Reka Core, Flash, and Edge: A Series of Powerful Multimodal Language Models

Paper • 2404.12387 • Published Apr 18, 2024 • 38
OpenELM: An Efficient Language Model Family with Open-source Training and Inference Framework

Paper • 2404.14619 • Published Apr 22, 2024 • 126

Papers - Attention

Linear Transformers with Learnable Kernel Functions are Better In-Context Models

Paper • 2402.10644 • Published Feb 16, 2024 • 79
GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints

Paper • 2305.13245 • Published May 22, 2023 • 5
ChunkAttention: Efficient Self-Attention with Prefix-Aware KV Cache and Two-Phase Partition

Paper • 2402.15220 • Published Feb 23, 2024 • 19
Sequence Parallelism: Long Sequence Training from System Perspective

Paper • 2105.13120 • Published May 26, 2021 • 5

Previous
1
2
3
4
Next

Company

TOS Privacy About Jobs

Website

Models Datasets Spaces Pricing Docs