Towards Efficient Deep Learning Systems with Learning-Based Optimizations

PhD Thesis Proposal Defence


Title: "Towards Efficient Deep Learning Systems with Learning-Based 
Optimizations"

by

Mr. Yiding WANG


Abstract:

Deep learning has demonstrated advanced performance in various computer 
vision and natural language processing tasks over the past decade. Deep 
learning models are now fundamental building blocks for applications 
including autonomous driving, cloud video analytics, sentiment analysis, 
and natural language inference. To achieve high accuracy for demanding 
tasks, deep neural networks grow rapidly in size and computation 
complexity and require high-fidelity and large volumes of data, making 
training and inference time-consuming and costly. These challenges become 
salient and motivate practitioners to focus on building machine learning 
systems. In recent years, the intersection of the traditional computer 
systems and machine learning topics has attracted considerable research 
attention, including applying machine learning techniques or learned 
policies in system designs (i.e., machine learning for system) and 
optimizing systems especially for machine learning pipelines and workloads 
(i.e., system for machine learning). Combining both, research on using 
machine learning techniques to optimize machine learning systems shows 
significant efficiency improvements for exploiting the inherent mechanism 
of learning tasks.

This dissertation proposal provides and discusses new directions to 
optimize the speed, accuracy, and system overhead of machine learning 
training and inference systems in different applications using 
learning-based techniques. We find that aligning system designs and 
machine learning workloads can let systems prioritize the data, neural 
network parameters, and computation that machine learning tasks really 
need to improve performance, e.g., achieving high-quality edge-cloud video 
analytics with low bandwidth consumption using optimized video data that 
preserves the necessary information, reducing models' training computation 
volume by focusing on under-trained parameters, and adaptively assigning 
less model capacity for simpler natural language queries with real-time 
semantic understanding. With three case studies ranging from training to 
inference and from computer vision to natural language processing, we 
show that using learning-based techniques to optimize the design of 
machine learning systems can precisely benefit the efficiency of machine 
learning applications.

First, we propose and analyze Runespoor, an edge-cloud video analytics 
system using super-resolution to manage the accuracy loss with compressed 
data over the network. Emerging deep learning-based video analytics tasks, 
e.g., object detection and semantic segmentation, demand 
computation-intensive neural networks and powerful computing resources on 
the cloud to achieve high inference accuracy. Due to the latency 
requirement and limited network bandwidth, edge-cloud systems adaptively 
compress the data to strike a balance between overall analytics accuracy 
and bandwidth consumption. However, the degraded data leads to another 
issue of poor tail accuracy, which means the extremely low accuracy of a 
few semantic classes and video frames. Modern applications like 
autonomous robotics especially value the tail accuracy performance, but 
suffer using the prior edge-cloud systems. Our analytics-aware 
super-resolution extends super-resolution, which is an effective technique 
that learns a mapping from low-resolution frames to 
high-resolution frames. Runespoor can reconstruct high-resolution frames 
tailored for the tail accuracy performance of video analytics tasks 
with augmented details from compressed low-resolution data on the server. 
Our evaluation shows that Runespoor improves class-wise tail accuracy by 
up to 300%, frame-wise 90%/99% tail accuracy by up to 22%/54%, and greatly 
improves the overall accuracy and bandwidth trade-off.

Next, we explore Egeria, a knowledge-guided deep learning training system 
that employs semantic knowledge from a reference model and knowledge 
distillation techniques to accelerate model training by accurately 
evaluating individual layers' training progress, safely freezing the 
converged ones, and saving their corresponding backward computation and 
communication. Training deep neural networks is time-consuming. While most 
existing efficient training solutions try to overlap/schedule computation 
and communication, Egeria goes one step further by skipping them through 
layer freezing. The key insight is that the training progress of internal 
neural network layers differs significantly, and front layers often become 
well-trained much earlier than deep layers. To explore this, we introduce 
the notion of training plasticity to quantify the training progress of 
layers. Informed by the latest knowledge distillation research, we use a 
reference model that is generated on the fly with quantization techniques 
and runs forward operations asynchronously on available CPUs to minimize 
the overhead. Our experiments with popular vision and language models show 
that Egeria achieves 19%-43% training speedup w.r.t. the state-of-the-art 
without sacrificing accuracy.

Finally, we present Tabi, an inference system with a multi-level inference 
engine optimized for large language models and diverse workloads by 
exploring the prediction confidence of neural networks and the 
Transformer's attention mechanism. Today's trend of building ever larger 
language models, while pushing the performance of natural language 
processing, adds significant latency to the inference stage. We observe 
that due to the diminishing returns of adding model parameters, a smaller 
model could make the same prediction as a costly large language model for 
a majority of queries. Based on this observation, we design Tabi that can 
serve queries using both small models and optional large ones, unlike the 
traditional one-model-for-all pattern. Tabi uses the calibrated confidence 
score to decide whether to return the accurate results of small models 
extremely fast or re-route them to large models. For re-routed queries, it 
uses attention-based word pruning and weighted ensemble techniques to 
offset the system overhead and accuracy loss. Tabi achieves 21%-40% 
average latency reduction (with comparable tail latency) over the 
state-of-the-art while meeting top-grade high accuracy targets.


Date:			Wednesday, 17 August 2022

Time:                  	2:00pm - 4:00pm

Zoom Meeting: 
https://hkust.zoom.us/j/94067933425?pwd=VTBPcm0zVDRFZ0lOVG9iR2dreHR5Zz09

Committee Members:	Prof. Kai Chen (Supervisor)
 			Dr. Yangqiu Song (Chairperson)
 			Prof. Bo Li
 			Dr. Wei Wang


**** ALL are Welcome ****