PKU Class 2024 Fall: Optimizaiton for Deep Learning
Instructor: Kun Yuan (kunyuan@pku.edu.cn)
Teaching assistants:
- Boao Kong (kongboao@stu.pku.edu.cn)
- Ziling Lin (linzl@stu.pku.edu.cn)
- Yuxi Liu (yuxiliu666@stu.pku.edu.cn)
- Yongqi Qiao (miraclefalls@stu.pku.edu.cn)
- Yilong Song (2301213059@pku.edu.cn)
- Baihao Wu (baihaowu@pku.edu.cn)
- Guangzhengao Yang (gzayang@stu.pku.edu.cn)
- Yuan Zhang (zy1002@stu.pku.edu.cn)
Classroom: 3pm - 6pm Tuesday, 三教403
Office hour: 4pm - 5pm Thursday, 静园六院220
References
Martin Jaggi and Nicolas Flammarion, Optimization for Machine Learning, EPFL Class CS-439
Chris De Sa, Advanced Machine Learning Systems, Cornell CS6787
Kun Yuan, Introduction to LLM, PKU
Materials
Lecture 1: Introduction
Lecture 2: Basics in Machine Learning and Langugae Models
- Part I: Basics in Machine Learning [Slides]
- Part II: Basics in Language models [Slides]
Lecture 4: Parameters and Computations in Decoder-only LLMs
- Parameters and Computations analysis [Slides]
- Memory analysis (Please read it after lecture 8) [Slides]
Lecture 5: Stochastic Gradient Descent
Lecture 6: Advances in SGD
Lecture 7: Momentum SGD
Lecture 8: Adaptive SGD
Midterm Exam: Good Luck :)
Lecture 9: Mixed-Precision Training
Lecture 10: Block Coordinate Descent
- Block coordinate descent; Coordinate friendly structure [Slides]
- Block-wise training in LLMs [Slides]
Lecture 11: Subspace Optimization
- Subspace optimization; GaLore; GoLore [Slides]
Lecture 12: Zeroth-order Optimization
- Finite difference; linear interpolation; sphere smoothing [Slides]
- MeZO; LOZO [Slides]
Lecture 13: Distributed Optimization
- Data Parallelism; Pipeline Parallelism; Tensor Parallelism [Slides]
- Decentralized Learning [Slides]
- Communication Compression; Local Learning [Slides]
Lecture 14: Flash Attention
- Memory access cost; Kernal fusion; Flash attention [Slides]