Course schedule

Date Topic Detail Presentor
1/13 Introduction slides Dr. Yang Wang
1/15 Framework TensorFlow: A System for Large-Scale Machine Learning (OSDI 16) Shuzhan Yang
1/15 Framework PyTorch: An Imperative Style, High-Performance Deep Learning Library (NIPS 19) Oliver Proudfoot
1/20 Framework Ray: A Distributed Framework for Emerging AI Applications (OSDI 18) Jintong Liu
1/20 Parallelism Scaling Distributed Machine Learning with the Parameter Server (OSDI 14) Goutham Kuncham
1/22 Parallelism Horovod: Fast and Easy Distributed Deep Learning in TensorFlow Kailun Lin
1/22 Parallelism Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism Andy Wu
1/27 Snow day
1/29 Transformer Overview of the Transformer model Dr. Andrew Perrault
2/3 Parallelism PipeDream: generalized pipeline parallelism for DNN training (SOSP 19) Qifan Yang
2/3 Parallelism Alpa: Automating Inter- and Intra-Operator Parallelism for Distributed Deep Learning (OSDI 22) Siyuan Zhang
2/5 Memory Training Deep Nets with Sublinear Memory Cost Yang Wang
2/5 Memory ZeRO: Memory Optimizations Toward Training Trillion Parameter Models (SC 20) Sungjae Lee
2/10 Memory ZeRO-Offload: Democratizing Billion-Scale Model Training (USENIX ATC 21) William Cheng
2/10 Memory FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness (NIPS 21) Yingtie Lei
2/12 Compiler TVM: An Automated End-to-End Optimizing Compiler for Deep Learning (OSDI 18) Siyuan Zhang
2/12 Compiler Triton: An Intermediate Language and Compiler for Tiled Neural Networks (MAPL 19) Iris Kuo
2/17 Compiler TASO: optimizing deep learning computation with automatic generation of graph substitutions (SOSP 19) Andy Wu
2/17 Compiler TensorIR: An Abstraction for Automatic Tensorized Program Optimization (ASPLOS 23) Iris Kuo
2/19 Checkpoint Check-N-Run: A Checkpointing System for Training Deep Learning Recommendation Models (NSDI 22) Nick Cliffel
2/19 Checkpoint GEMINI: Fast Failure Recovery in Distributed Training with In-Memory Checkpoints (SOSP 23) Dylan Tan
2/24 Fault Tolerance Oobleck: Resilient Distributed Training of Large Models Using Pipeline Templates (SOSP 23) Yingtie Lei
2/24 Fault Tolerance ReCycle: Resilient Training of Large DNNs using Pipeline Adaptation (SOSP 24) Jintong Liu
2/26 Model Search EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks (ICML 19) Qifan Yang
2/26 Model Search Once for All: Train One Network and Specialize it for Efficient Deployment (ICLR 20) Fangxun Liu
3/3 Quantization LLM.int8(): 8-bit matrix multiplication for transformers at scale (NIPS 22) Hojin Yoo
3/3 Quantization GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers (ICLR 23) Abeer Alshehri
3/5 Cluster Gandiva: Introspective Cluster Scheduling for Deep Learning (OSDI 18) William Cheng
3/5 Cluster Analysis of Large-Scale Multi-Tenant GPU Clusters for DNN Training Workloads (USENIX ATC 19) Goutham Kuncham
3/10 Cluster Tiresias: A GPU Cluster Manager for Distributed Deep Learning (NSDI 19) Yao Lu
3/10 Cluster Themis: Fair and Efficient GPU Cluster Scheduling (NSDI 20) Srinivasan Subramaniyan
3/12 Cluster AntMan: Dynamic Scaling on GPU Clusters for Deep Learning (OSDI 20) Chuyang Chen
3/12 Cluster Pollux: Co-adaptive Cluster Scheduling for Goodput-Optimized Deep Learning (OSDI 21) Chuyang Chen
3/17 Spring break
3/19 Spring break
3/24 Cluster MLaaS in the Wild: Workload Analysis and Scheduling in Large-Scale Heterogeneous GPU Clusters (NSDI 22) Srinivasan Subramaniyan
3/24 Cluster MAST: global scheduling of ML training across geo-distributed datacenters at hyperscale (OSDI 24) Dylan Tan
3/26 Inference TensorFlow-Serving: Flexible, High-Performance ML Serving Abeer Alshehri
3/26 Inference Serving DNNs like Clockwork: Performance Predictability from the Bottom Up (OSDI 20) Yao Lu
3/31 Inference DeepSpeed-inference: enabling efficient inference of transformer models at unprecedented scale (SC 22) Fangxun Liu
3/31 Inference DeepSpeed-MoE: Advancing Mixture-of-Experts Inference and Training to Power Next-Generation AI Scale (ICML 22) Sungjae Lee
4/2 Inference Orca: A Distributed Serving System for Transformer (OSDI 22) Yuan Ma
4/2 Inference Efficient Memory Management for LLM Serving with PagedAttention (SOSP 23) Hojin Yoo
4/7 Inference AlpaServe: Statistical Multiplexing with Model Parallelism for Deep Learning Serving (OSDI 23) Kailun Lin
4/7 Multimodal DistMM: Accelerating Distributed Multimodal Model Training Invited speaker: Jun Huang
4/9 Inference FlexGen: high-throughput generative inference of large language models with a single GPU (ICML 23) Shuzhan Yang
4/9 Inference DistServe: Disaggregating Prefill and Decoding for Goodput-optimized Large Language Model Serving (OSDI 24) Oliver Proudfoot
4/14 Inference "Splitwise: Efficient Generative LLM Inference Using Phase Splitting" (ISCA 24) Yuan Ma
4/16 Project Presentation ? ?
4/21 Project Presentation ? ?
4/23 Project Presentation ? ?