Join Algorithms: From External Memory to the BSP

PhD Thesis Proposal Defence


Title: "Join Algorithms: From External Memory to the BSP"

by

Miss Xiao HU


Abstract:

Database systems have been traditionally disk-based, which had motivated the 
extensive study on external memory (EM) algorithms. However, as RAMs continue 
to get larger and cheaper, modern distributed database systems are increasingly 
adopting a main memory based, shared-nothing architecture, exemplified by 
systems like Spark and Flink. These systems can be abstracted by the bulk 
synchronous parallel (BSP) model (with variants like the MPC model and the 
MapReduce model), and there has been a strong revived interest in designing BSP 
algorithms for handling large amounts of data. With hard disks starting to fade 
away from the picture, EM algorithms may now seem less relevant. However, we 
observe that the recently developed join algorithms under the BSP model have a 
high degree of resemblance with their counterparts in the EM model. In this 
proposal, we study the relationships between the EM and BSP model, and present 
a general theoretical framework for converting EM algorithms to the BSP.

More specifically, the current state of art for BSP algorithm design is still 
akin to that of PRAM (a fundamental mode in parallel computing but seldom 
directly used in parallel algorithm), i.e., one has to specify, in each round, 
what each BSP processor should compute and to which other processors messages 
should be sent. Instead, work-depth models have enjoyed much more popularity, 
as they relieve the algorithm designers and programmers from worrying about how 
various tasks should be assigned to each of the processors. Meanwhile, they 
also make it easy to study the fundamental parallel complexity of the 
algorithm, namely work and depth, which are irrelevant to the number of 
processors available. Seeing the impact of the work-depth models on parallel 
algorithm design, we propose an EM work-depth model, by incorporating elements 
from the EM model into an internal memory work-depth model (we choose the 
multi-thread model). We show that algorithms designed in this model can be 
optimally simulated by the BSP if possible at all, and illustrate how it can be 
used to more easily design BSP algorithms by parallelizing the corresponding EM 
algorithms. This simulation result is quite friendly to many problems, in 
particular those based on divide-and-conquer. The primary target problem we 
have investigated is the join problem, which is the most central operation in 
relational databases. By parallelizing the existing EM algorithms, we are able 
to obtain new, better, and simpler join algorithms in the BSP. This means that 
EM algorithms, which were traditionally designed to optimize disk I/O, can also 
be used in today's main memory only, shared-nothing systems.


Date:			Monday, 23 April 2018

Time:                  	4:00pm - 6:00pm

Venue:                  Room 2463
                         (lifts 25/26)

Committee Members:	Dr. Ke Yi (Supervisor)
  			Prof. Mordecai Golin (Chairperson)
 			Dr. Sunil Arya
 			Prof. Siu-Wing Cheng


**** ALL are Welcome ****