| Kun Yuan

Kun Yuan

PKU Class 2024 Spring: Introduction to Large Language Models

Instructor: Kun Yuan (kunyuan@pku.edu.cn)

Sponsor: Decision Intelligence Team, Alibaba DAMO Academy

Teaching assistants:

Yudong Bai (yutonghe@pku.edu.cn)
Yunteng Geng (2301213081@pku.edu.cn)
Yutong He (yutonghe@pku.edu.cn)
Peijin Li (2301213056@stu.pku.edu.cn)
Zihao Liu (2100011704@stu.pku.edu.cn)
Keer Lu (2301213094@stu.pku.edu.cn)
Yilong Song (2301213059@pku.edu.cn)
Qianyou Sun (2301111049@stu.pku.edu.cn)
Yuchi Wang (wangyuchi@stu.pku.edu.cn)

Office hour: 4pm - 5pm Wednesday, 静园六院220

References

Stanford CS224n: Natural Language Processing with Deep Learning

Lectrures

Lecture 1: Introduction to LLM

Introduction to large language model [Slides]
Reading:
- Andrej Karpathy, State of GPT
- Andrej Karpathy, The busy person’s intro to LLM

Lecture 2: Linear algebra and optimization

Convex set; Convex functions; Gradient descent; Convergence [Slides]
Notes: [Linear algebra] [Gradient descent]

Lecture 3: Basics in machine learning

Linear regression; Logistic regression; Multi-classification; Neural network [Slides]
Reading:
- Stanford CS231n, Linear classification
- Stanford CS231n, Neural netowrk part I

Lecture 4: Word embedding and language models

Word embedding; [Slides]
Language models; Recurrent neural network; (Slides are adapted from Stanford CS224n RNN)
Back propogation in RNN [Slides]
Sequence-to-sequence model (Slides are adapted from Stanford CS224n Seq2Seq)

Lecture 5: Transformer

Forward-Backward propogation [Hand-written materials]
Transformers (Slides are adapted from Stanford CS224n Transformers)
Parameters and Computations in Transformers [Slides]
Reading:
- Illustrated Guide to Transformers Neural Network: A step by step explanation. [Youtube video] [Bilibili video]

Guest Lecture I:

Large language model in mathematical reasoning (Dr. Jihai Zhang, Alibaba DAMO Academcy)

Lecture 6: Pretrain and Fine-tune Paradigm

Teacher forcing; Pretrain; Fine-tune; BERT; GPTs [Slides]
Reading:
- J. Devlin et.al., BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
- A. Radford et.al., Improving Language Understanding by Generative Pre-Training
- A. Radford et.al., Language models are unsupervised multitask learners
- T. B. Brown et.al., Language Models are Few-Shot Learners

Lecture 7: Optimizers

Stochastic gradient descent [Slides] [Notes]
Momentum SGD [Slides]
Adagrad; RMSProp; Adam [Slides]
Memories in Transformer [Slides]
Mixed-precision training [Slides]

Midterm Exam

Midterm review [Slides]

Lecture 8: Distributed Training

Scaling law [Slides]
Data parallelism and communication saving; Pipeline parallelism; Tensor parallelism [Slides]
Reading:
- Jared Kaplan, et. al. Scaling Laws for Neural Language Models
- Jordan Hoffmann, et. al. Training Compute-Optimal Large Language Models
- Tal Ben-Nun and Torsten Hoefler, Demystifying Parallel and Distributed Deep Learning: An In-Depth Concurrency Analysis
- Kun Yuan Distributed Machine Learning: Part I
- Kun Yuan Distributed Machine Learning: Part II

Lecture 9: Data Prepation

Data source; Deduplication; Quality filtering; Sensitive information reduction; Data composition; Data curriculum [Slides]
Reading:
- Penedo, Guilherme, et al., The Refined Web Dataset for Falcon LLM: Outperforming Curated Corpora with Web Data, and Web Data Only
- Soldaini, Luca, et al., Dolma: An Open Corpus of Three Trillion Tokens for Language Model Pretraining Research
- Kandpal, Nikhil, Eric Wallace, and Colin Raffel., Deduplicating Training Data Mitigates Privacy Risks in Language Models
- Xie, Sang Michael, et al., Data Selection for Language Models via Importance Resampling
- Chen, Mayee, et al., Skill-it! A data-driven skills framework for understanding and training language models

Lecture 10: Principals in Prompt Engineering

Pricipals in Prompt Engineering [Slides]
Chain of Thoughts [Slides]
Reading:
- OpenAI, Prompt Engineering
- Jason Wei, et. al. Chain-of-Thought Prompting Elicits Reasoning in Large Language Models
- Takeshi Kojima, et. al. Large Language Models are Zero-Shot Reasoners
- Zhuosheng Zhang, et. al. Automatic Chain of Thought Prompting in Large Language Models
- Shunyu Yao, et. al. Tree of Thoughts: Deliberate Problem Solving with Large Language Models

Lecture 11: LLM Based Agents

LLM Based Agents [Slides]
Reading:
- Z. Xi et.al., The Rise and Potential of Large Language Model Based Agents: A Survey
- Andrew Ng, Agentic Reasoning

Guest Lecture II:

Building Brainiac Buddy with LLM Agents (Zihao Liu, Beijing International Center for Mathematical Research)

Guest Lecture III:

Building LLM Agents with Alibaba MindOpt Studio (Dr. Jianfeng Yang, Alibaba DAMO Academcy)

Lecture 12: Retrieval Augmented Generation

Slides are adapted from RAG from Scratch
Reading:
- LangChain, RAG from Scratch
- P. Lewis, et. al., Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks
- Y. Gao, et. al., Retrieval-Augmented Generation for Large Language Models: A Survey

Lecture 13: Parameter-Efficient Fine-Tuning

Low-Rank adaptation (LoRA); LoRA+; DoRA; LISA [Slides]
Reading:
- E. Hu, et. al., LoRA: Low-Rank Adaptation of Large Language Models
- S. Hayou, et. al., LoRA+: Efficient Low Rank Adaptation of Large Models
- S. Liu, et. al., DoRA: Weight-Decomposed Low-Rank Adaptation
- R. Pan, et.al, LISA: Layerwise Importance Sampling for Memory-Efficient Large Language Model Fine-Tuning

Lecture 14: Alignment

SFT and RLHF [Slides]
Reading:
- Andrej Karpathy, State of GPT
- L. Ouyang, et. al., Training language models to follow instructions with human feedback
- Jan Leike, Aligning language models with humans
- Devon Wood-Thomas, AI Alignment and LLMs

Final Review

Final review [Slides]