Towards Communication-Efficient Distributed Training Systems

PhD Thesis Proposal Defence


Title: "Towards Communication-Efficient Distributed Training Systems"

by

Mr. Xinchen WAN


Abstract:

As the scaling law persists consistently, distributed training has become the 
standard methodology to manage the exponential increase in model size and 
training data. Following this trend, distributed training systems are developed 
to handle the complexities and scale of distributed training and to embrace the 
computation powers of multiple devices. However, the communication remains one 
of the major challenges in these systems. The significant communication issues 
vary in the overheads from gradient aggregation and embedding synchronization 
during the training stage, and the intricate scheduling across different 
hardware during other stages.

This dissertation delineates my research efforts in building a 
communication-efficient distributed training systems with multi-level 
optimizations for the communication stack of distributed training systems.

At the application-level, we firstly design DGS, a communication-efficient 
graph sampling framework for distributed GNN training. Its key idea is to 
reduce network communication cost by sampling neighborhood information based on 
the locality of the neighbor nodes in the cluster, and sampling data at both 
node and feature levels. As a result, DGS strikes the balance between 
communication efficiency and model accuracy, and integrates seamlessly with 
distributed GNN training systems.

We next propose G3, a scalable and efficient system for full-graph GNN 
training. G3 incorporates GNN hybrid parallelism to scale out full-graph 
training with meticulous peer-to-peer intermediate data sharing, and 
accelerates the training process by balancing workloads among workers through 
locality- aware iterative partitioning, and overlapping communication with 
computation through a multi-level pipeline scheduling algorithm. Although 
initially tailored for GNN training, we believe the fundamental principle of 
peer-to-peer sharing data in hybrid parallelism can be generalized to other 
training tasks.

At the communication library-level, we present Leo, a generic and efficient 
communication library for distributed training systems. Leo offers 1) a 
communication path abstraction to describe the diverse distributed services 
employed in systems with predictable communication performance across edge 
accelerators; 2) unified APIs and wrappers that simplify programming experience 
with automatic communication configuration; and 3) a built-in multi-path 
communication optimization strategy to enhance communication efficiency. We 
believe Leo can serve as a stepping stone for the development of 
hardware-accelerated distributed services in distributed training systems.


Date:                   Monday, 29 April 2024

Time:                   4:30pm - 6:00pm

Venue:                  Room 5506
                        Lifts 25/26

Committee Members:      Prof. Kai Chen (Supervisor)
                        Dr. Binhang Yuan (Chairperson)
                        Dr. Yangqiu Song
                        Dr. Weiwa Wang