Neural Architecture Design: Search Methods and Theoretical Understanding

The Hong Kong University of Science and Technology
Department of Computer Science and Engineering


PhD Thesis Defence


Title: "Neural Architecture Design: Search Methods and Theoretical 
Understanding"

By

Mr. Han SHI


Abstract

Deep Learning has emerged as a milestone in machine learning community due to 
its remarkable ability in a variety of tasks, such as computer vision and 
natural language processing. It has been demonstrated that the architecture of 
neural network influences the performance significantly and thus it's important 
to determine the neural architecture structure. Typically, the methods for 
neural architecture design can be classified into two categories.  One category 
is designing neural architecture by search methods, which aims to achieve 
potential neural architectures automatically. For example, NASNet architecture 
is found in a defined search space using reinforcement learning algorithm. 
Another one category is designing neural architecture manually based on some 
knowledge and theoretical understanding. Most practical architectures like 
ResNet and Transformer are proposed based on prior knowledge. In this thesis, 
we provide a comprehensive discussion on neural architecture design from above 
two perspectives.

Firstly, we introduce a neural architecture search algorithm using Bayesian 
optimization, named BONAS. In the search phase, GCN embedding extractor and 
Bayesian Sigmoid Regression constitute the surrogate model for Bayesian 
optimization and candidate architectures in the search space are selected based 
on the acquisition function. In the query phase, we merge them as a super 
network and evaluate each architecture by weight sharing mechanism. The 
proposed BONAS can obtain significant architecture with exploitation and 
exploration balance.

Secondly, we focus on the self-attention module in famous Transformer and 
propose a differentiable architecture search method to find important attention 
patterns. Different from prior works, we find that diagonal elements in the 
attention map can be dropped without harming the performance. To understand 
this observation, we provide a theoretical proof from the perspective of 
universal approximation. Furthermore, we achieve a series of attention masks 
for efficient architecture design based on our proposed search method.

Thirdly, we attempt to understand the feed-forward module in Transformer from a 
unified framework. Specifically, we introduce the concept of memory token and 
build the relationship between feed-forward and self-attention. Moreover, we 
propose a novel architecture named uni-attention, which contains all four types 
of attention connection in our framework. Uni-attention achieves better 
performance compared with previous baselines given the same number of memory 
tokens.

Finally, we investigate the over-smoothing phenomenon in whole Transformer 
architecture. We provide a theoretical analysis by building the relationship 
between the self-attention and the graph field. Specifically, we find that 
layer normalization plays a important role in the over-smoothing problem and 
verify this empirically. To alleviate this issue, we propose hierarchical 
fusion architectures such that the output can be more diverse.


Date:			Friday, 5 August 2022

Time:			10:00am - 12:00noon

Zoom Meeting: 		https://hkust.zoom.us/j/5599077828

Chairperson:		Prof. Toyotaka ISHIBASHI (LIFS)

Committee Members:	Prof. James KWOK (Supervisor)
 			Prof. Minhao CHENG
 			Prof. Yangqiu SONG
 			Prof. Yuan YAO (MATH)
 			Prof. Irwin KING (CUHK)


**** ALL are Welcome ****