CSE 5439 Sys4ML

Course schedule

Date	Topic	Detail	Presentor
1/13	Introduction	slides	Dr. Yang Wang
1/15	Framework	TensorFlow: A System for Large-Scale Machine Learning (OSDI 16)	Shuzhan Yang
1/15	Framework	PyTorch: An Imperative Style, High-Performance Deep Learning Library (NIPS 19)	Oliver Proudfoot
1/20	Framework	Ray: A Distributed Framework for Emerging AI Applications (OSDI 18)	Jintong Liu
1/20	Parallelism	Scaling Distributed Machine Learning with the Parameter Server (OSDI 14)	Goutham Kuncham
1/22	Parallelism	Horovod: Fast and Easy Distributed Deep Learning in TensorFlow	Kailun Lin
1/22	Parallelism	Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism	Andy Wu
1/27	Snow day
1/29	Transformer	Overview of the Transformer model	Dr. Andrew Perrault
2/3	Parallelism	PipeDream: generalized pipeline parallelism for DNN training (SOSP 19)	Qifan Yang
2/3	Parallelism	Alpa: Automating Inter- and Intra-Operator Parallelism for Distributed Deep Learning (OSDI 22)	Siyuan Zhang
2/5	Memory	Training Deep Nets with Sublinear Memory Cost	Yang Wang
2/5	Memory	ZeRO: Memory Optimizations Toward Training Trillion Parameter Models (SC 20)	Sungjae Lee
2/10	Memory	ZeRO-Offload: Democratizing Billion-Scale Model Training (USENIX ATC 21)	William Cheng
2/10	Memory	FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness (NIPS 21)	Yingtie Lei
2/12	Compiler	TVM: An Automated End-to-End Optimizing Compiler for Deep Learning (OSDI 18)	Siyuan Zhang
2/12	Compiler	Triton: An Intermediate Language and Compiler for Tiled Neural Networks (MAPL 19)	Iris Kuo
2/17	Compiler	TASO: optimizing deep learning computation with automatic generation of graph substitutions (SOSP 19)	Andy Wu
2/17	Compiler	TensorIR: An Abstraction for Automatic Tensorized Program Optimization (ASPLOS 23)	Iris Kuo
2/19	Checkpoint	Check-N-Run: A Checkpointing System for Training Deep Learning Recommendation Models (NSDI 22)	Nick Cliffel
2/19	Checkpoint	GEMINI: Fast Failure Recovery in Distributed Training with In-Memory Checkpoints (SOSP 23)	Dylan Tan
2/24	Fault Tolerance	Oobleck: Resilient Distributed Training of Large Models Using Pipeline Templates (SOSP 23)	Nick Cliffel
2/24	Fault Tolerance	ReCycle: Resilient Training of Large DNNs using Pipeline Adaptation (SOSP 24)	Jintong Liu
2/26	Model Search	EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks (ICML 19)	Qifan Yang
2/26	Model Search	Once for All: Train One Network and Specialize it for Efficient Deployment (ICLR 20)	Fangxun Liu
3/3	Quantization	LLM.int8(): 8-bit matrix multiplication for transformers at scale (NIPS 22)	Hojin Yoo
3/3	Quantization	GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers (ICLR 23)	Abeer Alshehri
3/5	Cluster	Gandiva: Introspective Cluster Scheduling for Deep Learning (OSDI 18)	William Cheng
3/5	Cluster	Analysis of Large-Scale Multi-Tenant GPU Clusters for DNN Training Workloads (USENIX ATC 19)	Goutham Kuncham
3/10	Cluster	Tiresias: A GPU Cluster Manager for Distributed Deep Learning (NSDI 19)	Yao Lu
3/10	Cluster	Themis: Fair and Efficient GPU Cluster Scheduling (NSDI 20)	Srinivasan Subramaniyan
3/12	Cluster	AntMan: Dynamic Scaling on GPU Clusters for Deep Learning (OSDI 20)	Chuyang Chen
3/12	Cluster	Pollux: Co-adaptive Cluster Scheduling for Goodput-Optimized Deep Learning (OSDI 21)	Srinivasan Subramaniyan
3/17	Spring break
3/19	Spring break
3/24	Cluster	MLaaS in the Wild: Workload Analysis and Scheduling in Large-Scale Heterogeneous GPU Clusters (NSDI 22)	Chuyang Chen
3/24	Cluster	MAST: global scheduling of ML training across geo-distributed datacenters at hyperscale (OSDI 24)	Dylan Tan
3/26	Inference	TensorFlow-Serving: Flexible, High-Performance ML Serving	Abeer Alshehri
3/26	Inference	Serving DNNs like Clockwork: Performance Predictability from the Bottom Up (OSDI 20)	Yao Lu
3/31	Inference	DeepSpeed-inference: enabling efficient inference of transformer models at unprecedented scale (SC 22)	Fangxun Liu
3/31	Inference	DeepSpeed-MoE: Advancing Mixture-of-Experts Inference and Training to Power Next-Generation AI Scale (ICML 22)	Sungjae Lee
4/2	Inference	Orca: A Distributed Serving System for Transformer (OSDI 22)	Yuan Ma
4/2	Inference	Efficient Memory Management for LLM Serving with PagedAttention (SOSP 23)	Hojin Yoo
4/7	Inference	AlpaServe: Statistical Multiplexing with Model Parallelism for Deep Learning Serving (OSDI 23)	Kailun Lin
4/7	Multimodal	DistMM: Accelerating Distributed Multimodal Model Training	Invited speaker: Jun Huang
4/9	Inference	FlexGen: high-throughput generative inference of large language models with a single GPU (ICML 23)	Shuzhan Yang
4/9	Inference	DistServe: Disaggregating Prefill and Decoding for Goodput-optimized Large Language Model Serving (OSDI 24)	Oliver Proudfoot
4/14	Inference	"Splitwise: Efficient Generative LLM Inference Using Phase Splitting" (ISCA 24)	Yuan Ma
4/16	Project Presentation	?	?
4/21	Project Presentation	?	?
4/23	Project Presentation	?	?