PKU Class 2026 Spring: Introduction to Foundation Models
Instructor: Kun Yuan (kunyuan@pku.edu.cn)
Teaching assistants:
- Daibo Li (2501210088@stu.pku.edu.cn)
- Yilong Song (2301213059@pku.edu.cn)
- Zhoutong Wu (2501111519@stu.pku.edu.cn)
Classroom: 6:30pm - 8:30pm Tuesday, 1:00pm - 3:00pm Thursday, 三教506
Office hour: 4pm - 5pm Wednesday, 静园六院220
References
Stanford CS224n: Natural Language Processing with Deep Learning
Projects
Course Projects
Lectrures
Lecture 1: Introduction to LLM
- Introduction to deep learning [Slides1]
- Introduction to large language model [Slides2]
- Reading:
Lecture 2: Machine Learning Basics
- Preliminary [Notes]
- Linear regression; Logistic regression; Multi-classification; Neural network [Slides]
- Reading:
Lecture 3: Language Models
- Word embedding; Language models; Recurrent neural networks [Slides]
- Seq2Seq; Attention; Transformer [Slides]
- Reading:
Lecture 4: Parameters, Computations, and Memories in Language Models
Lecture 5: Popular LLM Models
- Teacher forcing; Pretrain and Finetuning; BERT; GPTs [Slides]
- DeepSeek [Slides]
- Reading:
Lecture 6: Gradient descent
Lecture 7: Accelerated Gradient Descent
- Momentum gradient descent; Nesterov acceleration; Anderson acceleration [Slides] [Notes]
Lecture 8: Stochastic Gradient Descent
Lecture 9: Momentum and Adaptive SGD
Lecture 10: Flash Attention
- Memory access cost; Kernal fusion; Flash attention [Slides]
Lecture 11: Mixed-Precision
- FP32; FP16; Mixed-precision training [Slides]
Lecture 12: 3D Parallelism
- Data Parallelism; Pipeline Parallelism; Tensor Parallelism [Slides]
Lecture 13: Agents (by TA)
Lecture 14: Scaling Law
Lecture 15: Principals in Prompt Engineering
Lecture 16: Data Preparation
- Data source; Deduplication; Quality filtering; Sensitive information reduction; Data composition; Data curriculum [Slides]
- Reading:
- Penedo, Guilherme, et al., The Refined Web Dataset for Falcon LLM: Outperforming Curated Corpora with Web Data, and Web Data Only
- Soldaini, Luca, et al., Dolma: An Open Corpus of Three Trillion Tokens for Language Model Pretraining Research
- Kandpal, Nikhil, Eric Wallace, and Colin Raffel., Deduplicating Training Data Mitigates Privacy Risks in Language Models
- Xie, Sang Michael, et al., Data Selection for Language Models via Importance Resampling
- Chen, Mayee, et al., Skill-it! A data-driven skills framework for understanding and training language models
Lecture 17: Parameter-Efficient Fine-Tuning
- Low-Rank adaptation (LoRA); LoRA+; DoRA; LISA; BAdam [Slides]
- Reading:
- E. Hu, et. al., LoRA: Low-Rank Adaptation of Large Language Models
- S. Hayou, et. al., LoRA+: Efficient Low Rank Adaptation of Large Models
- S. Liu, et. al., DoRA: Weight-Decomposed Low-Rank Adaptation
- R. Pan, et.al., LISA: Layerwise Importance Sampling for Memory-Efficient Large Language Model Fine-Tuning
- Q. Luo, et.al., BAdam: A memory efficient full parameter optimization method for large language models
Lecture 18: Inference
- KV Cache; MLA; H2O; Streaming LLM; Quest; Speculative decoding; Page Attention [Slides]
- Reading:
Lecture 19: Alignment
- SFT and RLHF [Slides]
- Proximal policy optimization (PPO) (Slides are adapted from Hung-yi Lee’s lecture at Bilibili)
- Reasoning Large Language Models; DeepSeek-R1 (Slides are adapted from Hung-yi Lee’s lecture at Bilibili)
- Reading:
Lecture 20: Guest Lecture
- Leheng Chen: Harness Engineering [Slides]