Effective and efficient convolutional architectures for visual recognition

The Hong Kong University of Science and Technology
Department of Computer Science and Engineering


PhD Thesis Defence


Title: "Effective and efficient convolutional architectures for visual 
recognition"

By

Mr. Ningning MA


Abstract

Deep convolution networks (CNN) have shown great success in various computer 
vision tasks. However, improving the accuracy-speed tradeoff is still 
challenging. In this thesis, we divide CNNs into two categories: static CNNs 
and dynamic CNNs, according to whether the CNN architectures are conditioned on 
the input image. Specifically, we have two goals, one is to improve the 
efficiency of the static CNNs, and the other is to explore more effective 
dynamic neural architectures. For static CNNs, to improve the efficiency, we 
investigate the efficient CNN design guidelines (e.g., ShuffleNetV2). For 
dynamic CNNs, we present three simple, efficient, and effective methods.

First, we present WeightNet that decouples the convolutional kernels and the 
convolutional computation. This is different from the common practice that all 
the input samples share the same convolutional kernel. In that case, 
convolution kernels are learnable hyper-parameters, in our case, the kernels 
are learned by an additional simple network made of fully-connected layers. Our 
approach is general that unifies two current distinct and extremely effective 
SENet and CondConv into the same framework on weight space. We use the 
WeightNet, composed entirely of (grouped) fully-connected layers, to directly 
output the convolutional weight. The simple change has a large impact: it 
provides a meta-network design space, improves accuracy significantly, and 
achieves optimum Accuracy-FLOPs and Accuracy-Parameter trade-offs.

Next, we present a new visual activation we call funnel activation, that 
performs the non-linear transformation while simultaneously capturing the 
spatial dependency. Our method extends the ReLU by adding a negligible overhead 
spatial condition to replace the hand-designed zero in ReLU, which helps 
capture complicated visual layouts with regular convolution. Despite it seems a 
minor change, it has a large impact: it shows great improvements in many visual 
recognition tasks and even outperforms the complicated DeformableConv and 
SENet.

Third, we present a simple, effective, and general activation function we term 
ACON which learns to activate the neurons or not. Interestingly, we find Swish, 
the recent popular NAS-searched activation, can be interpreted as a smooth 
approximation to ReLU. Intuitively, in the same way, we approximate the more 
general Maxout family to our novel ACON family, which remarkably improves the 
performance and makes Swish a special case of ACON. Next, we present meta-ACON, 
which explicitly learns to optimize the parameter switching between non-linear 
(activate) and linear (inactivate) and provides a new design space. By simply 
changing the activation function, we show its effectiveness on both small 
models and highly optimized large models (e.g. it improves the ImageNet top-1 
accuracy rate by 6.7% and 1.8% on MobileNet-0.25 and ResNet-152, respectively). 
Moreover, our novel ACON can be naturally transferred to object detection and 
semantic segmentation, showing that ACON is an effective alternative in a 
variety of tasks.


Date:			Monday, 15 March 2021

Time:			1:00pm - 3:00pm

Zoom Meeting:		https://hkust.zoom.com.cn/j/4468144429

Chairperson:		Prof. Ross MURCH (ECE)

Committee Members:	Prof. Long QUAN (Supervisor)
 			Prof. Qifeng CHEN
 			Prof. Chiew Lan TAI
 			Prof. Kai TANG (MAE)
 			Prof. Wenping WANG (HKU)


**** ALL are Welcome ****