OPTIMIZE RESOURCE SCHEDULING IN MULTI-TENANT CLUSTERS AT SCALE

PhD Thesis Proposal Defence


Title: "OPTIMIZE RESOURCE SCHEDULING IN MULTI-TENANT CLUSTERS AT SCALE"

by

Mr. Qizhen WENG


Abstract:

With the rise in Cloud Computing over the past few decades, there has been a 
trend of employing large-scale shared clusters consisting of commodity machines 
to serve multiple user groups. Such multi-tenant clusters are usually highly 
heterogeneous, and their workloads are widely diverse. In this dissertation, we 
aim to improve the performance of workloads and reduce the operating costs of 
clusters by optimizing resource scheduling.

In clusters for online cloud services, long-running applications (LRAs), 
deployed in containers, are prevailing and of the highest priority. But placing 
LRA containers is known to be difficult; they often have sophisticated 
performance interactions (e.g., resource interferences and I/O dependencies) 
that are hard to be quantitatively evaluated by the existing constraint-based 
schedulers. Fortunately, we find that modern reinforcement learning (RL) 
techniques offer an appealing solution for LRA scheduling. We propose Metis, a 
general-purpose RL-based scheduler that learns to optimally place LRA 
containers and scales to production clusters with hierarchical learning 
techniques.

Shared clusters running diverse workloads of Machine Learning (ML) algorithms, 
on the other hand, are usually equipped with Graph Processing Units (GPUs) of 
different generations. However, the characteristics of such scenarios remain 
largely unexplored. We therefore present a comprehensive trace study of a 
typical ML-as-a-Service (MLaaS) cloud in the enterprise and discuss the 
scheduling opportunities and challenges with benchmarks and simulations. We not 
only show that GPU sharing and task recurrence can be leveraged to improve the 
cluster efficiency, but also reveal the presence of hard-toschedule tasks, the 
imbalance load across heterogeneous machines, the potential bottleneck on CPUs, 
and so forth, calling for further designs on resource scheduling.


Date:			Friday, 29 July 2022

Time:                  	4:00pm - 6:00pm

Zoom Meeting: 
https://hkust.zoom.us/j/91058544752?pwd=NkZhc3VUWC9hMVJPK3F5bjZmM3dtZz09

Committee Members:	Dr. Wei Wang (Supervisor)
 			Prof. Qian Zhang (Chairperson)
 			Prof. Kai Chen
 			Prof. Bo Li


**** ALL are Welcome ****