Lecture 2: Gradient descent
- Notes: Gradient descent
- Slides: Gradient descent
- Reading:
- Stephen Boyd and Lieven Vandenberghe, Convex Optimization, Cambridge university press, 2004. [Ch. 2 and 3]
- Yurii Nesterov, Introductory lectures on convex optimization: A basic course, Springer Science & Business Media, 2003 [Sec. 2.1.1, 2.1.3, and 2.1.5]
- Aston Zhang, Zack C. Lipton, Mu Li, and Alex J. Smola, Dive into Deep Learning, [Sec. 5.3]
- Notes: Sampling and Stability
- Part I: SGD with finite sample size; importance sampling; random reshuffling
- Part II: GD stability; SGD stability; Sharpness-aware minimization
- Reading:
- Kun Yuan et al., Stochastic gradient descent with finite samples sizes, IEEE Workshop on Machine Learning for Signal Processing, 2016
- Peilin Zhao and Tong Zhang, Stochastic Optimization with Importance Sampling for Regularized Loss Minimization, ICML, 2015.
- Bicheng Ying et al., Stochastic Learning under Random Reshuffling with Constant Step-sizes, IEEE Transactions on Signal Processing, 2018
- Lei Wu et al., How SGD selects the global minima in over-parameterized learning: A dynamical stability perspective, NeurIPS 2018
- Lei Wu et al., The alignment property of sgd noise and how it helps select flat minima: A stability analysis, NeurIPS 2022
- Pierre Foret et al., Sharpness-Aware Minimization for Efficiently Improving Generalization, ICML, 2020
- Notes I: Momentum SGD
- Notes II: Adaptive SGD
- Part I: Momentum SGD; SGD with Nesterov momentum; lower bound in stochastic optimization
- Part II: Preconditioned SGD; AdaGrad; RMSProp; Adam; AdamW
- Reading:
- Ilya Sutskever et al., On the importance of initialization and momentum in deep learning, ICML, 2013
- Kun Yuan et al., On the influence of momentum acceleration on online learning, JMLR, 2016
- John Duchi et al., Adaptive subgradient methods for online learning and stochastic optimization, JMLR, 2011
- Diederik P. Kingma et al., Adam: A Method for Stochastic Optimization, 2014
- Zhishuai Guo et al., A novel convergence analysis for algorithms of the adam family, 2021
- Ilya Loshchilov et al., Decoupled Weight Decay Regularization, ICLR, 2019
- Part I: Introduction to Meta Learning (We will use the great Slides from Prof. Hung-Yi Lee)
- Part II: Learning to Initilize; MAML; Reptile
- Part III: Learning to Optimize
- Reading:
- Chelsea Finn et al., Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks, ICML 2017
- Antreas Antoniou et al., How to train your MAML, ICLR 2019.
- Alex Nichol et al., On First-Order Meta-Learning Algorithms, 2018.
- Marcin Andrychowicz, Learning to learn by gradient descent by gradient descent, NIPS 2016.
- Karol Gregor, Learning Fast Approximations of Sparse Coding, ICML 2010.
- Part I: Introduction to distributed learning
- Part II: Decentralized communication; Average consenus; Dynamic average consensus
- Part III: DGD; Diffusion; EXTRA; Exact-Diffusion; Gradient-tracking
- Part IV: Transient stage; Stochastic decentralized algorithms
- Part V: Exponential graph; One-peer exponential graph; EquiTopo graph
- Part VI: BlueFog library
- Reading:
- Kun Yuan et al. On the convergence of decentralized gradient descent, SIAM Journal on Optimization, 2016.
- Wei Shi et al., EXTRA: An Exact First-Order Algorithm for Decentralized Consensus Optimization, SIAM Journal on Optimization, 2015.
- Angelia Nedich et al., Achieving Geometric Convergence for Distributed Optimization over Time-Varying Graphs, SIAM Journal on Optimization, 2017.
- Anastasia Koloskova et al., A Unified Theory of Decentralized SGD with Changing Topology and Local Updates, ICML 2020.
- Kun Yuan et al., Removing Data Heterogeneity Influence Enhances Network Topology Dependence of Decentralized SGD, JMLR 2023.
- Bicheng Ying et al., Exponential Graph is Provably Efficient for Decentralized Deep Training, NeurIPS 2021.
- Bicheng Ying et al., BlueFog: Make Decentralized Algorithms Practical for Optimization and Deep Learning, 2021.
