Deep Learning Workload Management in Large-Scale GPU Clusters

PhD Qualifying Examination


Title: "Deep Learning Workload Management in Large-Scale GPU Clusters"

by

Mr. Lingyun YANG


Abstract:

In the past decade, the rapid technological advance of deep learning (DL) 
has achieved remarkable performance in a variety of application domains. 
Large tech companies build large-scale heterogeneous computing clusters 
equipped with GPUs to accelerate the development of DL models. Compared to 
high-performance computing (HPC) and big data analytics workloads, DL 
workloads exhibit different characteristics such as gang scheduling and 
resource heterogeneity, which bring new challenges and opportunities for 
cluster resource management. Efficiently managing DL workloads can improve 
resource utilization, reduce operational costs, reduce energy consumption, 
etc.

This survey reviews the recent research efforts on GPU cluster management 
tailored for DL training and inference workloads. We first summarize how 
DL workloads are integrated into GPU clusters and their common 
characteristics. Then we present prior works according to their different 
optimization goals: resource utilization, job efficiency, and fairness 
among multiple tenants. We hope this survey can shed light on system 
optimization for GPU cluster management and facilitate future 
industrial-oriented designs.


Date:  			Thursday, 18 August 2022

Time:                  	4:00pm - 6:00pm

Zoom Meeting: 
https://hkust.zoom.us/j/93975876687?pwd=d0xRcmVpYWgwTDNwQnJENGF5K0Ftdz09

Committee Members:	Dr. Wei Wang (Supervisor)
 			Prof. Kai Chen (Chairperson)
 			Prof. Bo Li
 			Prof. Qian Zhang


**** ALL are Welcome ****