-
Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer
Paper • 1701.06538 • Published • 5 -
Sparse Networks from Scratch: Faster Training without Losing Performance
Paper • 1907.04840 • Published • 3 -
ZeRO: Memory Optimizations Toward Training Trillion Parameter Models
Paper • 1910.02054 • Published • 4 -
A Mixture of h-1 Heads is Better than h Heads
Paper • 2005.06537 • Published • 2
Collections
Discover the best community collections!
Collections including paper arxiv:2006.16668
-
Non-asymptotic oracle inequalities for the Lasso in high-dimensional mixture of experts
Paper • 2009.10622 • Published • 1 -
MoE-LLaVA: Mixture of Experts for Large Vision-Language Models
Paper • 2401.15947 • Published • 49 -
MoE-Mamba: Efficient Selective State Space Models with Mixture of Experts
Paper • 2401.04081 • Published • 70 -
MoE-Infinity: Activation-Aware Expert Offloading for Efficient MoE Serving
Paper • 2401.14361 • Published • 2
-
Mixtral of Experts
Paper • 2401.04088 • Published • 158 -
MoE-LLaVA: Mixture of Experts for Large Vision-Language Models
Paper • 2401.15947 • Published • 49 -
MoE-Mamba: Efficient Selective State Space Models with Mixture of Experts
Paper • 2401.04081 • Published • 70 -
EdgeMoE: Fast On-Device Inference of MoE-based Large Language Models
Paper • 2308.14352 • Published
-
FinTral: A Family of GPT-4 Level Multimodal Financial Large Language Models
Paper • 2402.10986 • Published • 77 -
bigcode/starcoder2-15b
Text Generation • Updated • 27.8k • • 576 -
Zephyr: Direct Distillation of LM Alignment
Paper • 2310.16944 • Published • 123 -
mixedbread-ai/mxbai-rerank-large-v1
Text Classification • Updated • 29.7k • 116
-
MoE-LLaVA: Mixture of Experts for Large Vision-Language Models
Paper • 2401.15947 • Published • 49 -
DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models
Paper • 2401.06066 • Published • 44 -
SwitchHead: Accelerating Transformers with Mixture-of-Experts Attention
Paper • 2312.07987 • Published • 41 -
Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity
Paper • 2101.03961 • Published • 14
-
Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer
Paper • 1701.06538 • Published • 5 -
Sparse Networks from Scratch: Faster Training without Losing Performance
Paper • 1907.04840 • Published • 3 -
ZeRO: Memory Optimizations Toward Training Trillion Parameter Models
Paper • 1910.02054 • Published • 4 -
A Mixture of h-1 Heads is Better than h Heads
Paper • 2005.06537 • Published • 2
-
DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models
Paper • 2401.06066 • Published • 44 -
GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding
Paper • 2006.16668 • Published • 3 -
OpenMoE: An Early Effort on Open Mixture-of-Experts Language Models
Paper • 2402.01739 • Published • 26 -
BlackMamba: Mixture of Experts for State-Space Models
Paper • 2402.01771 • Published • 23