TokenFormer: Rethinking Transformer Scaling with Tokenized Model Parameters Paper • 2410.23168 • Published Oct 30, 2024 • 24 • 5
nGPT: Normalized Transformer with Representation Learning on the Hypersphere Paper • 2410.01131 • Published Oct 1, 2024 • 9 • 1
Stuffed Mamba: State Collapse and State Capacity of RNN-Based Long-Context Modeling Paper • 2410.07145 • Published Oct 9, 2024 • 2 • 3
Round and Round We Go! What makes Rotary Positional Encodings useful? Paper • 2410.06205 • Published Oct 8, 2024 • 1 • 1
TPI-LLM: Serving 70B-scale LLMs Efficiently on Low-resource Edge Devices Paper • 2410.00531 • Published Oct 1, 2024 • 30 • 6
Aria: An Open Multimodal Native Mixture-of-Experts Model Paper • 2410.05993 • Published Oct 8, 2024 • 107 • 7
TPI-LLM: Serving 70B-scale LLMs Efficiently on Low-resource Edge Devices Paper • 2410.00531 • Published Oct 1, 2024 • 30 • 6
The Mamba in the Llama: Distilling and Accelerating Hybrid Models Paper • 2408.15237 • Published Aug 27, 2024 • 37 • 4
KTO: Model Alignment as Prospect Theoretic Optimization Paper • 2402.01306 • Published Feb 2, 2024 • 16 • 2
Planning In Natural Language Improves LLM Search For Code Generation Paper • 2409.03733 • Published Sep 5, 2024 • 1
FocusLLM: Scaling LLM's Context by Parallel Decoding Paper • 2408.11745 • Published Aug 21, 2024 • 23 • 3
Jamba-1.5: Hybrid Transformer-Mamba Models at Scale Paper • 2408.12570 • Published Aug 22, 2024 • 30 • 3
LLM Pruning and Distillation in Practice: The Minitron Approach Paper • 2408.11796 • Published Aug 21, 2024 • 57 • 4