Learning to Schedule Long-Running Applications in Shared Container Clusters

PhD Thesis Proposal Defence


Title: "Learning to Schedule Long-Running Applications in Shared Container 
Clusters"

by

Mr. Luping WANG


Abstract:

Online cloud services are increasingly deployed as long-running applications 
(LRAs) in containers. Placing LRA containers are known to be difficult as they 
often have sophisticated resource interferences and I/O dependencies. Existing 
schedulers rely on operators to manually express the container scheduling 
requirements as placement constraints and strive to satisfy as many constraints 
as possible. Such schedulers, however, fall short in performance as placement 
constraints only provide qualitative scheduling guidelines and minimizing 
constraint violations does not necessarily result in the optimal performance. 
In my work, we present Metis, a general-purpose scheduler that learns to 
optimally place LRA containers using deep reinforcement learning (RL) 
techniques. This eliminates the complex manual specification of placement 
constraints and offers, for the first time, concrete quantitative scheduling 
criteria. As directly training an RL agent does not scale, we develop a 
novel hierarchical learning technique that decomposes a complex container 
placement problem into a hierarchy of subproblems with significantly reduced 
state and action space. We show that many subproblems have similar structures 
and can hence be solved by training a unified RL agent offline. This work is 
accepted by IEEE/ACM International Conference for High Performance Computing, 
Networking, Storage, and Analysis (SC20).

In a following-up work, we present another scheduler, George, to achieve 
high-quality container performance subject to the operation constraints. In 
specific, we design tailored constrained policy optimization algorithm that 
projects the performance-improving training direction to a safe zone where 
the operation constraints can be satisfied. We provide theoretical proof to 
show the algorithm can guarantee an effective, stable, and safe learning 
process. Furthermore, to achieve timely decision-making, George transfers and 
temporally reuses the learned knowledge between sequential LRA scheduling 
events. By inheriting the previous knowledge and adapting it to the next 
decision-making process using Transfer Learning (TL) methods, George’s 
model training efforts can be dramatically alleviated. This work is 
under submission to ACM/IEEE SC 2021.


Date:			Wednesday, 7 April 2021

Time:                  	4:00pm - 6:00pm

Zoom Meeting:		https://hkust.zoom.us/j/5767775326

Committee Members:	Prof. Bo Li (Supervisor)
  			Dr. Yangqiu Song (Chairperson)
 			Prof. Lei Chen
 			Dr. Qiong Luo


**** ALL are Welcome ****