PKU Class 2023 Fall: Optimizaiton for Deep Learning
Instructor: Kun Yuan (kunyuan@pku.edu.cn)
Teaching assistants:
- Yutong He (yutonghe@pku.edu.cn)
- Jinghua Huang (jinghua@stu.pku.edu.cn)
- Pengfei Wu (pengfeiwu1999@stu.pku.edu.cn)
- Hao Yuan (pkuyuanhao@pku.edu.cn)
Office hour: 2pm - 3pm Thursday, 静园六院220
References
Martin Jaggi and Nicolas Flammarion, Optimization for Machine Learning, EPFL Class CS-439
Chris De Sa, Advanced Machine Learning Systems, Cornell CS6787
Final exam
Final exam [Exam_GH] [Exam_BD]
Code for Problem 4 [Code_GH] [Code_BD]
Please turn in your exam paper and code by 11:59 PM on January 14, 2024.
Projects
[Projects_GH] [Projects_BD]
Presentation materials are due by 11:59 AM on December 26, 2023.
Project codes and reports are due by 11:59 PM on January 14, 2024.
Materials
Remark: All materials can be retrieved from two sources: GitHub and Baidu Wangpan.
Lecture 1: Introduction
Lecture 2: Gradient descent
- Notes: Gradient descent [Notes_GH] [Notes_BD]
- Slides: Gradient descent [Slides_GH] [Slides_BD]
- Homework 2: [Homework_GH] [Homework_BD]
- Reading:
- Stephen Boyd and Lieven Vandenberghe, Convex Optimization, Cambridge university press, 2004. [Ch. 2 and 3]
- Yurii Nesterov, Introductory lectures on convex optimization: A basic course, Springer Science & Business Media, 2003 [Sec. 2.1.1, 2.1.3, and 2.1.5]
- Aston Zhang, Zack C. Lipton, Mu Li, and Alex J. Smola, Dive into Deep Learning, [Sec. 5.3]
Lecture 3: Accelerated gradient descent
Lecture 4: Projected gradient descent and Proximal gradient descent
Lecture 5: Zeroth-order optimization
Lecture 6: Stochastic gradient descent
Lecture 7: Stochastic gradient descent: sampling strategy and stability
- Notes: Sampling and Stability [Notes_GH][Notes_BD]
- Part I: SGD with finite sample size; importance sampling; random reshuffling [Slides_GH] [Slides_BD]
- Part II: GD stability; SGD stability; Sharpness-aware minimization [Slides_GH] [Slides_BD]
- Homework 7: [Homework_GH] [Homework_BD]
- Reading:
- Kun Yuan et al., Stochastic gradient descent with finite samples sizes, IEEE Workshop on Machine Learning for Signal Processing, 2016
- Peilin Zhao and Tong Zhang, Stochastic Optimization with Importance Sampling for Regularized Loss Minimization, ICML, 2015.
- Bicheng Ying et al., Stochastic Learning under Random Reshuffling with Constant Step-sizes, IEEE Transactions on Signal Processing, 2018
- Lei Wu et al., How SGD selects the global minima in over-parameterized learning: A dynamical stability perspective, NeurIPS 2018
- Lei Wu et al., The alignment property of sgd noise and how it helps select flat minima: A stability analysis, NeurIPS 2022
- Pierre Foret et al., Sharpness-Aware Minimization for Efficiently Improving Generalization, ICML, 2020
Lecture 8: Momentum and Adaptive SGD
- Notes I: Momentum SGD [Notes_GH][Notes_BD]
- Notes II: Adaptive SGD [Notes_GH][Notes_BD]
- Part I: Momentum SGD; SGD with Nesterov momentum; lower bound in stochastic optimization [Slides_GH] [Slides_BD]
- Part II: Preconditioned SGD; AdaGrad; RMSProp; Adam; AdamW [Slides_GH] [Slides_BD]
- Homework: No homework
- Reading:
- Ilya Sutskever et al., On the importance of initialization and momentum in deep learning, ICML, 2013
- Kun Yuan et al., On the influence of momentum acceleration on online learning, JMLR, 2016
- John Duchi et al., Adaptive subgradient methods for online learning and stochastic optimization, JMLR, 2011
- Diederik P. Kingma et al., Adam: A Method for Stochastic Optimization, 2014
- Zhishuai Guo et al., A novel convergence analysis for algorithms of the adam family, 2021
- Ilya Loshchilov et al., Decoupled Weight Decay Regularization, ICLR, 2019
Lecture 9: Variance reduction
Lecture 10-1: Adversarial learning
Lecture 10-2: Gradient clipping
Lecture 11: Mixed-Precision Training
- Part I: Introduction to Meta Learning (We will use the great Slides from Prof. Hung-Yi Lee)
- Part II: Learning to Initilize; MAML; Reptile [Slides_GH][Slides_BD]
- Part III: Learning to Optimize [Slides_GH][Slides_BD]
- Reading:
- Chelsea Finn et al., Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks, ICML 2017
- Antreas Antoniou et al., How to train your MAML, ICLR 2019.
- Alex Nichol et al., On First-Order Meta-Learning Algorithms, 2018.
- Marcin Andrychowicz et.al., Learning to learn by gradient descent by gradient descent, NIPS 2016.
- Karol Gregor et.al., Learning Fast Approximations of Sparse Coding, ICML 2010.
Lecture 13: Decentralized Learning
- Part I: Introduction to distributed learning [Slides_GH][Slides_BD]
- Part II: Decentralized communication; Average consenus; Dynamic average consensus [Slides_GH][Slides_BD]
- Part III: DGD; Diffusion; EXTRA; Exact-Diffusion; Gradient-tracking [Slides_GH][Slides_BD]
- Part IV: Transient stage; Stochastic decentralized algorithms [Slides_GH][Slides_BD]
- Part V: Exponential graph; One-peer exponential graph; EquiTopo graph [Slides_GH][Slides_BD]
- Part VI: BlueFog library [Slides_GH][Slides_BD]
- Reading:
- Kun Yuan et al. On the convergence of decentralized gradient descent, SIAM Journal on Optimization, 2016.
- Wei Shi et al., EXTRA: An Exact First-Order Algorithm for Decentralized Consensus Optimization, SIAM Journal on Optimization, 2015.
- Angelia Nedich et al., Achieving Geometric Convergence for Distributed Optimization over Time-Varying Graphs, SIAM Journal on Optimization, 2017.
- Anastasia Koloskova et al., A Unified Theory of Decentralized SGD with Changing Topology and Local Updates, ICML 2020.
- Kun Yuan et al., Removing Data Heterogeneity Influence Enhances Network Topology Dependence of Decentralized SGD, JMLR 2023.
- Bicheng Ying et al., Exponential Graph is Provably Efficient for Decentralized Deep Training, NeurIPS 2021.
- Bicheng Ying et al., BlueFog: Make Decentralized Algorithms Practical for Optimization and Deep Learning, 2021.
Lecture 14: Federated Learning