u-brixton
's Collections
Why do Learning Rates Transfer? Reconciling Optimization and Scaling
Limits for Deep Learning
Paper
•
2402.17457
•
Published
Curvature-Informed SGD via General Purpose Lie-Group Preconditioners
Paper
•
2402.04553
•
Published
TextGrad: Automatic "Differentiation" via Text
Paper
•
2406.07496
•
Published
•
27
Surge Phenomenon in Optimal Learning Rate and Batch Size Scaling
Paper
•
2405.14578
•
Published
Fast Benchmarking of Accuracy vs. Training Time with Cyclic Learning
Rates
Paper
•
2206.00832
•
Published
Large Language Models as Markov Chains
Paper
•
2410.02724
•
Published
•
30
Old Optimizer, New Norm: An Anthology
Paper
•
2409.20325
•
Published
•
3
Scaling Law with Learning Rate Annealing
Paper
•
2408.11029
•
Published
•
3
What Happened in LLMs Layers when Trained for Fast vs. Slow Thinking: A
Gradient Perspective
Paper
•
2410.23743
•
Published
•
59
ReLU's Revival: On the Entropic Overload in Normalization-Free Large
Language Models
Paper
•
2410.09637
•
Published
•
3
In-context learning and Occam's razor
Paper
•
2410.14086
•
Published
•
2
nGPT: Normalized Transformer with Representation Learning on the
Hypersphere
Paper
•
2410.01131
•
Published
•
9
Cautious Optimizers: Improving Training with One Line of Code
Paper
•
2411.16085
•
Published
•
15
MARS: Unleashing the Power of Variance Reduction for Training Large
Models
Paper
•
2411.10438
•
Published
•
13
Understanding Gradient Descent through the Training Jacobian
Paper
•
2412.07003
•
Published