| Kun Yuan

PKU Class 2026 Spring: Introduction to Foundation Models

Instructor: Kun Yuan (kunyuan@pku.edu.cn)

Teaching assistants:

Daibo Li (2501210088@stu.pku.edu.cn)
Yilong Song (2301213059@pku.edu.cn)
Zhoutong Wu (2501111519@stu.pku.edu.cn)

Classroom: 6:30pm - 8:30pm Tuesday, 1:00pm - 3:00pm Thursday, 三教506

Office hour: 4pm - 5pm Wednesday, 静园六院220

References

Stanford CS224n: Natural Language Processing with Deep Learning

Projects

Course Projects

Lectrures

Lecture 1: Introduction to LLM

Introduction to deep learning [Slides1]
Introduction to large language model [Slides2]
Reading:
- Andrej Karpathy, State of GPT
- Andrej Karpathy, The busy person’s intro to LLM

Lecture 2: Machine Learning Basics

Preliminary [Notes]
Linear regression; Logistic regression; Multi-classification; Neural network [Slides]
Reading:
- Stanford CS231n, Linear classification
- Stanford CS231n, Neural netowrk part I

Lecture 3: Language Models

Word embedding; Language models; Recurrent neural networks [Slides]
Seq2Seq; Attention; Transformer [Slides]
Reading:
- Stanford CS224N: Week 4
- Ilya Sutskever et. al., Sequence to Sequence Learning with Neural Networks
- Ashish Vaswani et. al., Attention Is All You Need

Lecture 4: Parameters, Computations, and Memories in Language Models

Parameters, Computations, and Memory Costs in Transformer [Slides01] [Slides02]

Lecture 5: Popular LLM Models

Teacher forcing; Pretrain and Finetuning; BERT; GPTs [Slides]
DeepSeek [Slides]
Reading:
- J. Devlin et.al., BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
- A. Radford et.al., Improving Language Understanding by Generative Pre-Training
- A. Radford et.al., Language models are unsupervised multitask learners
- T. B. Brown et.al., Language Models are Few-Shot Learners
- DeepSeek, DeepSeek-V3 Technical Report

Lecture 6: Gradient descent

Convex set; Convex functions; Convex problems; Gradient descent [Slides] [Notes]
Forward-backward propagation [Slides] [Notes]

Lecture 7: Accelerated Gradient Descent

Momentum gradient descent; Nesterov acceleration; Anderson acceleration [Slides] [Notes]

Lecture 8: Stochastic Gradient Descent

Stochastic gradient descent [Slides] [Notes]

Lecture 9: Momentum and Adaptive SGD

Momentum SGD [Slides] [Notes]
AdaGrad; RMSProp; Adam [Slides]

Lecture 10: Flash Attention

Memory access cost; Kernal fusion; Flash attention [Slides]

Lecture 11: Mixed-Precision

FP32; FP16; Mixed-precision training [Slides]

Lecture 12: 3D Parallelism

Data Parallelism; Pipeline Parallelism; Tensor Parallelism [Slides]

Lecture 13: Agents (by TA)

From LLMs to Agents [Slides]
Introduction to Claude Code [Slides]
Building Agents with Claude Code [Slides]

Lecture 19: Alignment

SFT and RLHF [Slides]
Proximal policy optimization (PPO) (Slides are adapted from Hung-yi Lee’s lecture at Bilibili)
Reasoning Large Language Models; DeepSeek-R1 (Slides are adapted from Hung-yi Lee’s lecture at Bilibili)
Reading:
- Andrej Karpathy, State of GPT
- L. Ouyang, et. al., Training language models to follow instructions with human feedback
- Jan Leike, Aligning language models with humans
- Devon Wood-Thomas, AI Alignment and LLMs
- DeepSeek Team, DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Lecture 20: Guest Lecture

Leheng Chen: Harness Engineering [Slides]