Towards Efficient and Secure Large-Scale Systems for Distributed Machine Learning Training

The Hong Kong University of Science and Technology
Department of Computer Science and Engineering


PhD Thesis Defence


Title: "Towards Efficient and Secure Large-Scale Systems for
Distributed Machine Learning Training"

By

Mr. Chengliang ZHANG


Abstract:

Machine learning (ML) techniques have advanced in leaps and bounds in the 
past decade. Its success critically relies on the abundant computing power 
and the availability of big data, it is impractical to host ML training on 
a single machine, and a sole data source usually does not produce a 
general enough model. By distributing ML workload across multiple machines 
and utilizing data across multiple silos, we are able to substantially 
improve the quality of ML training. As large-scale ML training are 
increasingly deployed in production systems involving multiple entities, 
how to improve efficiency, and ensure the confidentiality of the 
participants become the pressing needs. First, how to efficiently train an 
ML model in a cluster with the presence of heterogeneity? Second, in the 
context of federated learning (FL) where multiple data owners 
collaboratively train a model together, how to mitigate the overhead 
introduced by the privacy-preserving techniques? Lastly, in the nuance 
case where many organizations who own data but not ML expertise would like 
to pool their data and collaborate with those who have expertise (model 
owner) to train generalizable models, how to protect the model owner's 
intellectual property (model privacy) while preserving the data privacy of 
data owners?

General ML training solutions find themselves inadequate under the 
efficiency and privacy challenges posed by distributed ML. First, 
traditional distributed ML systems often conduct asynchronous training to 
mitigate the impact of stragglers. While it maximizes the training 
throughput, the price paid is degraded training quality as there are 
inconsistency across workers. Second, although techniques like Homomorphic 
Encryption (HE) can be conveniently adopted to preserve data privacy in 
FL, they induce prohibitively high computation and communication 
overheads. Third, there is yet to be a practical solution that can protect 
model owner's intellectual properties without compromising data owner's 
privacy.

To fill in the gaps mentioned above, we profile, analyze, and propose new 
strategies to improve training efficiency and privacy guarantees.

To improve the efficiency in distributed asynchronous training, we first 
propose a new distributed synchronization scheme, termed speculative 
synchronization. Our scheme allows workers to speculate about the recent 
parameter updates from others on the fly, and if necessary, the workers 
abort the ongoing computation, pull fresher parameters, and start over to 
improve the quality of training. We implement our scheme and demonstrate 
that speculative synchronization achieves substantial speedups over the 
asynchronous parallel scheme with minimal communication overhead.

Second, we present BatchCrypt, a system solution for cross-silo FL that 
significantly reduces the encryption and communication overhead caused by 
HE. Instead of encrypting individual gradients with full precision, we 
encode a batch of quantized gradients into a long integer and encrypt it 
in one go. To allow gradient-wise aggregation to be performed on 
ciphertexts of the encoded batches, we develop new quantization and 
encoding schemes along with a novel gradient clipping technique. Our 
evaluations confirm that BatchCrypt can effectively reduce the computation 
and communication overhead.

Lastly, to address the collaborative learning scenarios where model 
privacy is also required, we devise a scalable system Citadel. Citadel 
protects privacy for both data owner and model owner in untrusted 
infrastructures with the help of Intel SGX. Citadel performs distributed 
training across multiple training enclaves running on behalf of data 
owners and an aggregator enclave on behalf of the model owner. Citadel 
further establishes a strong information barrier between these enclaves by 
means of zero-sum masking and hierarchical aggregation to prevent 
data/model leakage during collaborative training. We deploy Citadel on 
cloud to train various ML models, and prove that it is scalable while 
providing strong privacy guarantees.


Date:			Wednesday, 31 March 2021

Time:			1:00pm - 3:00pm

Zoom Meeting: 
https://hkust.zoom.us/j/96913415039?pwd=MzV3a2Mrbk5qTS9uU05Kb3BHRVVJdz09

Chairperson:		Prof. Kun XU (MATH)

Committee Members:	Prof. Wei WANG (Supervisor)
 			Prof. Bo LI
 			Prof. Shuai WANG
 			Prof. Jiang XU (ECE)
 			Prof. Song GUO (PolyU)


**** ALL are Welcome ****