| Kun Yuan

PKU Class 2023 Fall: Optimizaiton for Deep Learning

Instructor: Kun Yuan (kunyuan@pku.edu.cn)

Teaching assistants:

Yutong He (yutonghe@pku.edu.cn)
Jinghua Huang (jinghua@stu.pku.edu.cn)
Pengfei Wu (pengfeiwu1999@stu.pku.edu.cn)
Hao Yuan (pkuyuanhao@pku.edu.cn)

Office hour: 2pm - 3pm Thursday, 静园六院220

References

Martin Jaggi and Nicolas Flammarion, Optimization for Machine Learning, EPFL Class CS-439
Chris De Sa, Advanced Machine Learning Systems, Cornell CS6787

Final exam

Final exam [Exam_GH] [Exam_BD]

Code for Problem 4 [Code_GH] [Code_BD]

Please turn in your exam paper and code by 11:59 PM on January 14, 2024.

Projects

[Projects_GH] [Projects_BD]

Presentation materials are due by 11:59 AM on December 26, 2023.

Project codes and reports are due by 11:59 PM on January 14, 2024.

Materials

Remark: All materials can be retrieved from two sources: GitHub and Baidu Wangpan.

Lecture 1: Introduction

Warm up: Preliminary [Notes_GH] [Notes_BD]
Part I: Overview on fundamental algorithms for optimization [Slides_GH] [Slides_BD]
Part II: Overview on fundamental algorithms for training deep neural network [Slides_GH] [Slides_BD]
Part III: Overview on advanced algorithms for training deep neural network [Slides_GH] [Slides_BD]
Part IV: Overview on distributed algorithms for training deep neural network [Slides_GH] [Slides_BD]
Homework 1: Review the preliminary notes.

Lecture 2: Gradient descent

Notes: Gradient descent [Notes_GH] [Notes_BD]
Slides: Gradient descent [Slides_GH] [Slides_BD]
Homework 2: [Homework_GH] [Homework_BD]
Reading:
- Stephen Boyd and Lieven Vandenberghe, Convex Optimization, Cambridge university press, 2004. [Ch. 2 and 3]
- Yurii Nesterov, Introductory lectures on convex optimization: A basic course, Springer Science & Business Media, 2003 [Sec. 2.1.1, 2.1.3, and 2.1.5]
- Aston Zhang, Zack C. Lipton, Mu Li, and Alex J. Smola, Dive into Deep Learning, [Sec. 5.3]

Lecture 3: Accelerated gradient descent

Notes: Accelerated gradient descent [Notes_GH][Notes_BD]
Part I: Polyak momentum; Nesterov momentum; Anderson acceleration; Lower bound [Slides_GH] [Slides_BD]
Part II: Preconditoned GD [Slides_GH] [Slides_BD]
Homework 3: [Homework_GH] [Homework_BD]
Reading:
- Yurii Nesterov, Introductory lectures on convex optimization: A basic course, Springer Science & Business Media, 2003 [Sec. 2.1.2, 2.1.4, and 2.2]
- Vien V. Mai and Mikael Johansson, Anderson Acceleration of Proximal Gradient Methods, International Conference on Machine Learning (ICML), 2020

Lecture 4: Projected gradient descent and Proximal gradient descent

Notes: Projected gradient descent [Notes_GH][Notes_BD]; Proximal gradient descent [Notes_GH][Notes_BD]
Part I: Projection; Constrained minimization; Projected gradient descent [Slides_GH] [Slides_BD]
Part II: Regularizers; Proximity operator; Proximal gradient descent [Slides_GH] [Slides_BD]
Homework 4: [Homework_GH] [Homework_BD]

Lecture 5: Zeroth-order optimization

Notes: Zeroth-order optimization [Notes_GH][Notes_BD]
Notes: ZO-GD with sphere smoothing [Notes_GH][Notes_BD]
Slides: Zeroth-order gradient descent; Finite difference; Linear interpolation; Sphere smoothing [Slides_GH] [Slides_BD]
Homework 5: [Homework_GH] [Homework_BD]
Reading:
- Yujie Tang, Introduction to Zeroth-Order Optimization, 2022
- Ahmad Ajalloeian and Sebastian U. Stich, On the Convergence of SGD with Biased Gradients, 2021
- Sijia Liu et. al., A Primer on Zeroth-Order Optimization in Signal Processing and Machine Learning, 2020
- Sadhika Malladi et. al., Fine-Tuning Language Models with Just Forward Passes,2023

Lecture 6: Stochastic gradient descent

Notes: Stochastic gradient descent [Notes_GH][Notes_BD]
Part I: Stochastic optimization; Stochastic gradient descent; Mini-batch SGD [Slides_GH] [Slides_BD]
Part II: Introduction to convolutional neural network [Slides_GH] [Slides_BD]
Homework 6: [Homework_GH] [Homework_BD]
Reading:
- Rong Ge et al., Escaping from saddle points—online stochastic gradient for tensor decomposition, Conference on learning theory, 2015
- Chi Jin et al., How to escape saddle points efficiently, International conference on machine learning, 2017.
- Student Notes: Convolutional Neural Networks (CNN) Introduction
- Andrew Ng, Convolutional Neural Networks

Lecture 7: Stochastic gradient descent: sampling strategy and stability

Notes: Sampling and Stability [Notes_GH][Notes_BD]
Part I: SGD with finite sample size; importance sampling; random reshuffling [Slides_GH] [Slides_BD]
Part II: GD stability; SGD stability; Sharpness-aware minimization [Slides_GH] [Slides_BD]
Homework 7: [Homework_GH] [Homework_BD]
Reading:
- Kun Yuan et al., Stochastic gradient descent with finite samples sizes, IEEE Workshop on Machine Learning for Signal Processing, 2016
- Peilin Zhao and Tong Zhang, Stochastic Optimization with Importance Sampling for Regularized Loss Minimization, ICML, 2015.
- Bicheng Ying et al., Stochastic Learning under Random Reshuffling with Constant Step-sizes, IEEE Transactions on Signal Processing, 2018
- Lei Wu et al., How SGD selects the global minima in over-parameterized learning: A dynamical stability perspective, NeurIPS 2018
- Lei Wu et al., The alignment property of sgd noise and how it helps select flat minima: A stability analysis, NeurIPS 2022
- Pierre Foret et al., Sharpness-Aware Minimization for Efficiently Improving Generalization, ICML, 2020

Lecture 8: Momentum and Adaptive SGD

Notes I: Momentum SGD [Notes_GH][Notes_BD]
Notes II: Adaptive SGD [Notes_GH][Notes_BD]
Part I: Momentum SGD; SGD with Nesterov momentum; lower bound in stochastic optimization [Slides_GH] [Slides_BD]
Part II: Preconditioned SGD; AdaGrad; RMSProp; Adam; AdamW [Slides_GH] [Slides_BD]
Homework: No homework
Reading:
- Ilya Sutskever et al., On the importance of initialization and momentum in deep learning, ICML, 2013
- Kun Yuan et al., On the influence of momentum acceleration on online learning, JMLR, 2016
- John Duchi et al., Adaptive subgradient methods for online learning and stochastic optimization, JMLR, 2011
- Diederik P. Kingma et al., Adam: A Method for Stochastic Optimization, 2014
- Zhishuai Guo et al., A novel convergence analysis for algorithms of the adam family, 2021
- Ilya Loshchilov et al., Decoupled Weight Decay Regularization, ICLR, 2019

Lecture 11: Mixed-Precision Training

FP32; FP16; Mixed-precision training; 8 bit Adam optimizer; SGD with mixed precision [Slides_GH][Slides_BD]
Transformer [Slides_GH][Slides_BD]
Homework 9: [Homework_GH] [Homework_BD]
Reading:
- Paulius Micikevicius et al., Mixed Precision Training, ICLR 2018.
- Tim Dettmers et al., 8-bit Optimizers via Block-wise Quantization, ICLR 2022.
- Dan Alistarh et.al., QSGD: Communication-Efficient SGD via Gradient Quantization and Encoding, NeurIPS 2017.
- Transformer模型详解.

Lecture 12: Meta Learning

Part I: Introduction to Meta Learning (We will use the great Slides from Prof. Hung-Yi Lee)
Part II: Learning to Initilize; MAML; Reptile [Slides_GH][Slides_BD]
Part III: Learning to Optimize [Slides_GH][Slides_BD]
Reading:
- Chelsea Finn et al., Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks, ICML 2017
- Antreas Antoniou et al., How to train your MAML, ICLR 2019.
- Alex Nichol et al., On First-Order Meta-Learning Algorithms, 2018.
- Marcin Andrychowicz et.al., Learning to learn by gradient descent by gradient descent, NIPS 2016.
- Karol Gregor et.al., Learning Fast Approximations of Sparse Coding, ICML 2010.

Lecture 13: Decentralized Learning

Part I: Introduction to distributed learning [Slides_GH][Slides_BD]
Part II: Decentralized communication; Average consenus; Dynamic average consensus [Slides_GH][Slides_BD]
Part III: DGD; Diffusion; EXTRA; Exact-Diffusion; Gradient-tracking [Slides_GH][Slides_BD]
Part IV: Transient stage; Stochastic decentralized algorithms [Slides_GH][Slides_BD]
Part V: Exponential graph; One-peer exponential graph; EquiTopo graph [Slides_GH][Slides_BD]
Part VI: BlueFog library [Slides_GH][Slides_BD]
Reading:
- Kun Yuan et al. On the convergence of decentralized gradient descent, SIAM Journal on Optimization, 2016.
- Wei Shi et al., EXTRA: An Exact First-Order Algorithm for Decentralized Consensus Optimization, SIAM Journal on Optimization, 2015.
- Angelia Nedich et al., Achieving Geometric Convergence for Distributed Optimization over Time-Varying Graphs, SIAM Journal on Optimization, 2017.
- Anastasia Koloskova et al., A Unified Theory of Decentralized SGD with Changing Topology and Local Updates, ICML 2020.
- Kun Yuan et al., Removing Data Heterogeneity Influence Enhances Network Topology Dependence of Decentralized SGD, JMLR 2023.
- Bicheng Ying et al., Exponential Graph is Provably Efficient for Decentralized Deep Training, NeurIPS 2021.
- Bicheng Ying et al., BlueFog: Make Decentralized Algorithms Practical for Optimization and Deep Learning, 2021.

Lecture 14: Federated Learning

Motivation; Local update; FedAvg; Scaffold; Partial client participation [Slides_GH][Slides_BD]
Reading:
- Jakub Konečný et al. Federated Optimization: Distributed Optimization Beyond the Datacenter, 2015.
- Sai Praneeth Karimireddy et al. SCAFFOLD: Stochastic Controlled Averaging for Federated Learning, ICML 2020.