Towards Efficient Transports for Datacenter Networking with High Environmental Variations

The Hong Kong University of Science and Technology
Department of Computer Science and Engineering


PhD Thesis Defence


Title: "Towards Efficient Transports for Datacenter Networking with High 
Environmental Variations"

By

Mr. Junxue ZHANG


Abstract

In real-world datacenter networking, high environmental variations exist. 
For instance, the base RTT, which is assumed to be stable, can have up to 
2.68X variations due to the varying processing delay caused by network 
components such as networking stack, middlebox, hypervisor, etc, 
Furthermore, besides the RTT variations, there are also other 
environmental variations in datacenters, eg, traffic pattern, topology, 
failure, etc, posing challenges towards transports design for datacenter 
networks.

From the algorithm level, the high environmental variations make heuristic 
ECN-based transports difficult to deliver optimal performance. One 
concrete example is that the RTT variations make it difficult for 
datacenter operators to derive the proper ECN marking threshold to 
simultaneously deliver high throughput, low latency and good burst 
tolerance communications. Furthermore, we find that adaptive neural 
network (NN) driven transports can learn and adapt to the varying 
environment, which shows its potential to be successful in datacenter 
networking with high environmental variations. However, current NN-based 
transports fail to deliver optimal performance from the deployment level, 
leading to either performance loss or large overhead.

This thesis describes our research efforts in designing efficient 
transports for datacenter networking with high environmental variations. 
First, to solve the problem of degraded performance with high RTT 
variations, we propose a new heuristic ECN-based transport -- ECN#, ECN# 
extends the current ECN marking mechanism to consider both instantaneous 
and persistent congestion. Our evaluations show that ECN# can effectively 
reduce latency without hurting throughput. For example, compared to the 
current practice, ECN# achieves up to 23.4% (31.2%) lower average (99th 
percentile) flow completion time (FCT) for short flows while delivering 
similar FCT for large flows under production workloads. Second, to make 
adaptive NN-based transports available for datacenter networking, we 
propose LiteFlow. LiteFlow is a hybrid framework to deploy 
high-performance adaptive NNs for kernel datapath by decoupling the 
control path of adaptive NNs into a kernel-space fast path for efficient 
model inference, and a userspace slow path for effective model tuning. We 
evaluate LiteFlow with two real-world NN-based CC schemes. Experiment 
results show that for flow goodput, LiteFlow with these NNs can outperform 
userspace-deployed NNs by up to 44.4% while suffering no more overhead 
than kernel-space CC algorithms such as BBR and CUBIC.


Date:			Friday, 26 August 2022

Time:			4:00pm - 6:00pm

Zoom Meeting:
https://hkust.zoom.us/j/94626581311?pwd=bnJXeWZmb1Q1L2ozTDMrTHNpTzNadz09

Chairperson:		Prof. Kevin CHEN (ECE)

Committee Members:	Prof. Kai CHEN (Supervisor)
 			Prof. Gary CHAN
 			Prof. Ke YI
 			Prof. Xuanyu CAO (ECE)
 			Prof. Hong XU (CUHK)


**** ALL are Welcome ****