PKU Class 2025 Spring: Introduction to Foundation Models
Instructor: Kun Yuan (kunyuan@pku.edu.cn)
Teaching assistants:
- Jie Hu (hujie@stu.pku.edu.cn)
- Yipeng Hu (2301213082@stu.pku.edu.cn)
- Qiulin Shang (2100013145@stu.pku.edu.cn)
- Yilong Song (2301213059@pku.edu.cn)
Office hour: 4pm - 5pm Wednesday, 静园六院220
References
Stanford CS224n: Natural Language Processing with Deep Learning
Lectrures
Lecture 1: Introduction to LLM
Lecture 2: Basics in machine learning
- Warm up: Preliminary [Notes]
- Linear regression; Logistic regression; Multi-classification; Neural network [Slides]
- Reading:
Lecture 3: Gradient descent
- Convex set; Convex functions; Convex problems; Gradient descent [Slides] [Notes]
- Forward-backward propagation [Notes]
Lecture 4: Stochastic gradient descent
- Stochastic gradient descent (SGD); mini-batch SGD [Slides] [Notes]
- Mini-batch forward-backward propagation [Slides]
- Reading:
Lecture 5: Adavanced optimizers
- Momentum SGD; Nesterov SGD [Slides]
- Adaptive SGD; AdaGrad; RMSProp; Adam [Slides]
Lecture 6: Language models
- Word embedding; Language models; Recurrent neural networks [Slides]
- Reading:
- Seq2seq models; cross-attention; self-attention; transformers [Slides]
- Reading:
Lecture 8: Flash Attention
- Memory access cost; Kernal fusion; Flash attention [Slides]
Lecture 9: Midterm Review
Lecture 10: Mixed-Precision
- FP32; FP16; Mixed-precision training [Slides]
Lecture 11: 3D Parallelism
- Data Parallelism; Pipeline Parallelism; Tensor Parallelism [Slides]
Lecture 12: Popular LLM Models
- Teacher forcing; Pretrain and Finetuning; BERT; GPTs [Slides]
- DeepSeek [Slides]
- Reading:
Lecture 13: Scaling Law
Lecture 14: Principals in Prompt Engineering
Lecture 15: Data Preparation
- Data source; Deduplication; Quality filtering; Sensitive information reduction; Data composition; Data curriculum [Slides]
- Reading:
- Penedo, Guilherme, et al., The Refined Web Dataset for Falcon LLM: Outperforming Curated Corpora with Web Data, and Web Data Only
- Soldaini, Luca, et al., Dolma: An Open Corpus of Three Trillion Tokens for Language Model Pretraining Research
- Kandpal, Nikhil, Eric Wallace, and Colin Raffel., Deduplicating Training Data Mitigates Privacy Risks in Language Models
- Xie, Sang Michael, et al., Data Selection for Language Models via Importance Resampling
- Chen, Mayee, et al., Skill-it! A data-driven skills framework for understanding and training language models
Lecture 16: LLM Based Agents
Lecture 17: Parameter-Efficient Fine-Tuning
- Low-Rank adaptation (LoRA); LoRA+; DoRA; LISA; BAdam [Slides]
- Reading:
- E. Hu, et. al., LoRA: Low-Rank Adaptation of Large Language Models
- S. Hayou, et. al., LoRA+: Efficient Low Rank Adaptation of Large Models
- S. Liu, et. al., DoRA: Weight-Decomposed Low-Rank Adaptation
- R. Pan, et.al., LISA: Layerwise Importance Sampling for Memory-Efficient Large Language Model Fine-Tuning
- Q. Luo, et.al., BAdam: A memory efficient full parameter optimization method for large language models
Lecture 18: Inference
- KV Cache; MLA; H2O; Streaming LLM; Quest; Speculative decoding; Page Attention [Slides]
- Reading:
Lecture 19: Alignment
- SFT and RLHF [Slides]
- Proximal policy optimization (PPO) (Slides are adapted from Hung-yi Lee’s lecture at Bilibili)
- Reasoning Large Language Models; DeepSeek-R1 (Slides are adapted from Hung-yi Lee’s lecture at Bilibili)
- Reading: