Observable and Economical Dataflow Computation in Datacenters

The Hong Kong University of Science and Technology
Department of Computer Science and Engineering


PhD Thesis Defence


Title: "Observable and Economical Dataflow Computation in Datacenters"

By

Mr. Huangshi TIAN


Abstract

With the proliferation of data emerges a myriad of dataflow frameworks. When 
they are deployed in a datacenter and productized as a service, their 
performance and cost become two primary concerns. However, performance issues 
prevail in dataflow computation. Their diagnosis is complicated by the 
heterogeneity of dataflow frameworks because the frameworks differ in 
underlying design, application domain, and computation complexity. It poses 
challenges for service providers and users to debug and locate the problems. A 
side effect of performance issues is higher resource costs as the datacenter 
operator cannot easily determine the appropriate allocation that could 
guarantee stable performance, thus leading to unwanted resource waste.

To tackle the challenges of performance and cost, the dissertation first 
characterizes dataflow computation in a large datacenter by analyzing a 
recently released workload trace. It examines the static properties of job DAGs 
and the runtime characteristics of their task execution. Statically, the DAGs 
are discovered to exhibit high artificiality when compared with random graphs. 
The dependent tasks may have significant variability in resource usage and 
duration—–even for recurring tasks. The results confirm the challenge of 
performance debugging and resource allocation.

To diagnose performance issues, the dissertation enables resource observability 
in dataflow computation by proposing CrystalPerf, a new approach that learns to 
characterize the performance of dataflow computation based on code analysis. It 
requires no code instrumentation and applies to a wide variety of dataflow 
frameworks. Our key insight is that the source code of an operation contains 
learnable syntactic and semantic patterns that reveal how it uses resources. 
Our approach establishes a performance-resource model that, given a dataflow 
program, infers automatically how much time each operation has spent on each 
resource (e.g., CPU, network, disk) from past execution traces and the program 
source code, using machine learning techniques. Extensive evaluations and 
real-world case studies show that CrystalPerfcan predict job performance and 
accurately detect runtime bottlenecks of DAG jobs.

To reduce resource costs, the dissertation proposed Owl, an overcommitted 
scheduler for executing dataflow computation on serverless platforms. It 
achieves high utilization without compromising performance with a dual 
approach. (1) For less-invoked functions, it allocates resources to the 
sandboxes with usage-based heuristic, keeps monitoring their performance, and 
remedies any detected degradation. (2) For frequently-invoked functions, Owl 
profiles the interference patterns among collocated functions and places the 
sandboxes under the guidance of profiles. Owl further consolidates idle 
sandboxes to reduce resource waste. We prototype OWL in our production system 
and implement a representative benchmark suite to evaluate it. The results 
demonstrate that the prototype could reduce VM cost by 43.80% and effectively 
mitigate latency degradation, with negligible overhead incurred.


Date:			Tuesday, 19 July 2022

Time:			1:00pm - 3:00pm

Zoom Meeting: 
https://hkust.zoom.us/j/99656972022?pwd=SzI1R1hTa2xIR0tqTWNqTDNkQThHZz09

Chairperson:		Prof. Jidong ZHAO (CIVL)

Committee Members:	Prof. Wei WANG (Supervisor)
 			Prof. Bo LI
 			Prof. Shuai WANG
 			Prof. Jiang XU (ECE)
 			Prof. Chuan WU (HKU)


**** ALL are Welcome ****