memorial
will be added to the awesome list of papers
- “Towards Understanding Grokking: An Effective Theory of Representation Learning”,“2022-05-20”,“https://arxiv.org/abs/2205.10343”,“Ziming Liu; Ouail Kitouni; Niklas Nolte; Eric J. Michaud; Max Tegmark; Mike Williams”
paper about the benefits of very large stepsize in gradient descent / EoS:
“Large Stepsize Gradient Descent for Logistic Loss: Non-Monotonicity of the Loss Improves Optimization Efficiency”,“2024-02-24”,“https://arxiv.org/abs/2402.15926”,“Jingfeng Wu; Peter L. Bartlett; Matus Telgarsky; Bin Yu”
“Large Stepsize Gradient Descent for Non-Homogeneous Two-Layer Networks: Margin Improvement and Fast Optimization”,“2024-06-12”,“https://arxiv.org/abs/2406.08654”,“Yuhang Cai; Jingfeng Wu; Song Mei; Michael Lindsey; Peter L. Bartlett”
“Implicit Bias of Gradient Descent for Logistic Regression at the Edge of Stability”,“2023-05-19”,“https://arxiv.org/abs/2305.11788”,“Jingfeng Wu; Vladimir Braverman; Jason D. Lee”
“On the Noisy Gradient Descent that Generalizes as SGD”,“2019-06-18”,“https://arxiv.org/abs/1906.07405”,“Jingfeng Wu; Wenqing Hu; Haoyi Xiong; Jun Huan; Vladimir Braverman; Zhanxing Zhu”
to-read list:
“Rethinking Conventional Wisdom in Machine Learning: From Generalization to Scaling”,“2024-09-23”,“https://arxiv.org/abs/2409.15156”,“Lechao Xiao”
“Learning From Biased Soft Labels”,“2023-02-16”,“https://arxiv.org/abs/2302.08155”,“Hua Yuan; Ning Xu; Yu Shi; Xin Geng; Yong Rui”
“Towards Understanding Why Mask-Reconstruction Pretraining Helps in Downstream Tasks”,“2022-06-08”,“https://arxiv.org/abs/2206.03826”,“Jiachun Pan; Pan Zhou; Shuicheng Yan”
“How Many Pretraining Tasks Are Needed for In-Context Learning of Linear Regression?”,“2023-10-12”,“https://arxiv.org/abs/2310.08391”,“Jingfeng Wu; Difan Zou; Zixiang Chen; Vladimir Braverman; Quanquan Gu; Peter L. Bartlett”
miscellaneous:
https://github.com/xjdr-alt/entropix: Entropy Based Sampling and Parallel CoT Decoding
related works of Towards Understanding Ensemble, Knowledge Distillation and Self-Distillation in Deep Learning
Scaling Laws in Linear Regression: Compute, Parameters, and Data