Large Scale Optimization Methods for Machine Learning

The Hong Kong University of Science and Technology
Department of Computer Science and Engineering


PhD Thesis Defence


Title: "Large Scale Optimization Methods for Machine Learning"

By

Mr. Shuai ZHENG


Abstract

Dealing with large-scale dataset and increasingly complex model has been a big 
challenge for optimization methods in machine learning. The stochastic gradient 
descent (SGD) method has been widely viewed as an ideal approach for 
large-scale machine learning problems while the conventional batch gradient 
method typically falters. Despite its flexibility and scalability, the 
stochastic gradient is associated with high variance which impedes training. In 
this thesis, we introduce new optimization methods with a focus on fundamental 
challenges such as scalability, computational and communication efficiency, for 
tackling the large-scale machine learning tasks. The first part of this thesis 
introduces optimization methods for optimizing empirical risk minimization 
(ERM) problems, based on the principle of variance reduction. We firstly 
propose a fast and scalable stochastic ADMM method for solving ERM problem with 
complex nonsmooth regularizers such as graph lasso and group lasso. Using 
similar principles, we also introduce a stochastic continuation method to 
optimize convex problems where loss and regularizer are both nonsmooth. While 
the existing approaches rely crucially on the assumption that the dataset is 
finite, we develop two SGD-like algorithms for the finite sums with infinite 
data (data augmentation). The proposed algorithms outperform existing methods 
in terms of both iteration complexity and storage.

The second part of this thesis investigates adaptive gradient methods for 
solving optimization problems in deep learning. Inspired by the recent 
advancement in adaptive gradient methods for training deep neural networks, we 
present a fast and powerful optimization algorithm called the 
follow-the-moving-leader (FTML) method. The new algorithm significantly 
outperforms the existing approaches, thereby advancing the state-of-the-art 
results. As coordinate-wise adaptive methods, including FTML, can generalize 
worse than SGD methods in some cases, we propose blockwise adaptivity, which 
balances the adaptivity and the generalization. Both theoretically and 
empirically, it improves convergence and generalization over the 
coordinate-wise counterpart.

The last part of this thesis studies two critical aspects of distributed 
optimization methods -- asynchronicity and communication efficiency. We develop 
a distributed asynchronous gradient-based method with fast linear convergence 
rate for strongly convex ERM problems, improving upon the existing distributed 
machine learning algorithms. On the other hand, the scalability of large-scale 
distributed training of deep neural networks is often limited by the 
communication overhead. Motivated by the recent advances in optimization with 
compressed gradient, we propose a communication-efficient distributed blockwise 
momentum SGD with error-feedback. The proposed method provably converges to a 
stationary point at the same asymptotic rate as distributed synchronous 
momentum SGD while achieving a nearly 32x reduction on communication cost.


Date:			Monday, 8 July 2019

Time:			10:00am - 12:00noon

Venue:			Room 5501
 			Lifts 25/26

Chairman:		Prof. Howard C LUONG (ECE)

Committee Members:	Prof. Tin-Yau Kwok (Supervisor)
 			Prof. Brian Mak
 			Prof. Raymond Wong
 			Prof. Tong Zhang (MATH)
 			Prof. Zhihua Zhang (Beijing University)


**** ALL are Welcome ****