| 1/13 |
Introduction |
slides |
Dr. Yang Wang |
| 1/15 |
Framework |
TensorFlow: A System for Large-Scale Machine Learning (OSDI 16) |
Shuzhan Yang |
| 1/15 |
Framework |
PyTorch: An Imperative Style, High-Performance Deep Learning Library (NIPS 19) |
Oliver Proudfoot |
| 1/20 |
Framework |
Ray: A Distributed Framework for Emerging AI Applications (OSDI 18) |
Jintong Liu |
| 1/20 |
Parallelism |
Scaling Distributed Machine Learning with the Parameter Server (OSDI 14) |
? |
| 1/22 |
Parallelism |
Horovod: Fast and Easy Distributed Deep Learning in TensorFlow |
Kailun Lin |
| 1/22 |
Parallelism |
Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism |
Andy Wu |
| 1/27 |
Parallelism |
PipeDream: generalized pipeline parallelism for DNN training (SOSP 19) |
? |
| 1/27 |
Parallelism |
Alpa: Automating Inter- and Intra-Operator Parallelism for Distributed Deep Learning (OSDI 22) |
? |
| 1/29 |
Transformer |
Overview of the Transformer model |
Dr. Andrew Perrault |
| 2/3 |
Memory |
Training Deep Nets with Sublinear Memory Cost |
? |
| 2/3 |
Memory |
ZeRO: Memory Optimizations Toward Training Trillion Parameter Models (SC 20) |
? |
| 2/5 |
Memory |
ZeRO-Offload: Democratizing Billion-Scale Model Training (USENIX ATC 21) |
? |
| 2/5 |
Memory |
FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness (NIPS 21) |
? |
| 2/10 |
Compiler |
TVM: An Automated End-to-End Optimizing Compiler for Deep Learning (OSDI 18) |
? |
| 2/10 |
Compiler |
Triton: An Intermediate Language and Compiler for Tiled Neural Networks (MAPL 19) |
? |
| 2/12 |
Compiler |
TASO: optimizing deep learning computation with automatic generation of graph substitutions (SOSP 19) |
? |
| 2/12 |
Compiler |
TensorIR: An Abstraction for Automatic Tensorized Program Optimization (ASPLOS 23) |
? |
| 2/17 |
Checkpoint |
Check-N-Run: A Checkpointing System for Training Deep Learning Recommendation Models (NSDI 22) |
? |
| 2/17 |
Checkpoint |
GEMINI: Fast Failure Recovery in Distributed Training with In-Memory Checkpoints (SOSP 23) |
? |
| 2/19 |
Fault Tolerance |
Oobleck: Resilient Distributed Training of Large Models Using Pipeline Templates (SOSP 23) |
? |
| 2/19 |
Fault Tolerance |
ReCycle: Resilient Training of Large DNNs using Pipeline Adaptation (SOSP 24) |
? |
| 2/24 |
Model Search |
EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks (ICML 19) |
? |
| 2/24 |
Model Search |
Once for All: Train One Network and Specialize it for Efficient Deployment (ICLR 20) |
? |
| 2/26 |
Quantization |
LLM.int8(): 8-bit matrix multiplication for transformers at scale (NIPS 22) |
? |
| 2/26 |
Quantization |
GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers (ICLR 23) |
? |
| 3/3 |
Cluster |
Gandiva: Introspective Cluster Scheduling for Deep Learning (OSDI 18) |
? |
| 3/3 |
Cluster |
Analysis of Large-Scale Multi-Tenant GPU Clusters for DNN Training Workloads (USENIX ATC 19) |
? |
| 3/5 |
Cluster |
Tiresias: A GPU Cluster Manager for Distributed Deep Learning (NSDI 19) |
? |
| 3/5 |
Cluster |
Themis: Fair and Efficient GPU Cluster Scheduling (NSDI 20) |
? |
| 3/10 |
Cluster |
AntMan: Dynamic Scaling on GPU Clusters for Deep Learning (OSDI 20) |
? |
| 3/10 |
Cluster |
Pollux: Co-adaptive Cluster Scheduling for Goodput-Optimized Deep Learning (OSDI 21) |
? |
| 3/12 |
Cluster |
MLaaS in the Wild: Workload Analysis and Scheduling in Large-Scale Heterogeneous GPU Clusters (NSDI 22) |
? |
| 3/12 |
Cluster |
MAST: global scheduling of ML training across geo-distributed datacenters at hyperscale (OSDI 24) |
? |
| 3/17 |
Sprint break |
|
|
| 3/19 |
Sprint break |
|
|
| 3/24 |
Inference |
TensorFlow-Serving: Flexible, High-Performance ML Serving |
? |
| 3/24 |
Inference |
Serving DNNs like Clockwork: Performance Predictability from the Bottom Up (OSDI 20) |
? |
| 3/26 |
Inference |
DeepSpeed-inference: enabling efficient inference of transformer models at unprecedented scale (SC 22) |
? |
| 3/26 |
Inference |
DeepSpeed-MoE: Advancing Mixture-of-Experts Inference and Training to Power Next-Generation AI Scale (ICML 22) |
? |
| 3/31 |
Inference |
Orca: A Distributed Serving System for Transformer (OSDI 22) |
? |
| 3/31 |
Inference |
Efficient Memory Management for LLM Serving with PagedAttention (SOSP 23) |
? |
| 4/2 |
Inference |
AlpaServe: Statistical Multiplexing with Model Parallelism for Deep Learning Serving (OSDI 23) |
? |
| 4/2 |
Inference |
FlexGen: high-throughput generative inference of large language models with a single GPU (ICML 23) |
? |
| 4/7 |
Inference |
DistServe: Disaggregating Prefill and Decoding for Goodput-optimized Large Language Model Serving (OSDI 24) |
? |
| 4/7 |
TBD |
TBD |
? |
| 4/9 |
TBD |
TBD |
? |
| 4/14 |
TBD |
TBD |
? |
| 4/16 |
Project Presentation |
? |
? |
| 4/21 |
Project Presentation |
? |
? |
| 4/23 |
Project Presentation |
? |
? |