Approximate Join Processing by Random Sampling

The Hong Kong University of Science and Technology
Department of Computer Science and Engineering


PhD Thesis Defence


Title: "Approximate Join Processing by Random Sampling"

By

Mr. Bin WU


Abstract

For many of today’s data-driven analytical tasks, users often need to pose ad 
hoc complex join queries involving multiple relational tables over gigabytes or 
even terabytes of data. When the data set is large, computing exact answers can 
be time-consuming. An important observation is that these analytical queries 
often do not require exact results. Therefore, approximate query processing 
techniques have been extensively studied in the literature, which offer users a 
nice and flexible tradeoff between query efficiency and accuracy. While the 
problem is relatively well solved when the query is posed on a single table, 
current solutions for complex join queries involving multiple tables are still 
far from satisfactory. In this thesis proposal, we set to address this latter 
issue, making progress on both the theory and practice of approximate join 
processing through the use of random sampling.

Firstly, we study the problem in a simplified setting, that of estimating the 
number of triangles in a large graph, which is self-join over the three copies 
of the same relation with two attributes. This problem has been extensively 
studied in the streaming model, which by definition requires at least one pass 
over the entire data set. Nevertheless, we observe that the ideas in many of 
these streaming algorithms can be coupled with random sampling to yield 
sublinear-time algorithms. We make these random sampling triangle counting 
algorithms more explicit, and present an experimental and analytical comparison 
of different approaches, identifying the best performers among a number of 
candidates.

Next, we study the general problem of approximate processing of complex join 
queries. We generalize our triangle sampling algorithm to arbitrary join 
queries, by performing random walks over the underlying join graph. We also 
design an optimizer that chooses the optimal plan for conducting the random 
walks without having to collect any statistics a priori. Extensive experiments 
using the TPC-H benchmark have demonstrated the superior performance of wander 
join over previous works.

Finally, we have implemented our algorithm in a number of popular database 
systems, demonstrating the practicality of the algorithm and its performance in 
full-fledged systems. This includes PostgreSQL, a widely used opensource 
database system, SparkSQL, the next-generation parallel data processing 
platform designed to scale, and PL/SQL, so that our algorithm can be provided 
as an add-on package to commercial database systems whose kernel code is not 
accessible.


Date:			Wednesday, 31 May 2017

Time:			10:00am - 12:00noon

Venue:			Room 4475
 			Lifts 25/26

Chairman:		Prof. Jianan Qu (ECE)

Committee Members:	Prof. Ke Yi (Supervisor)
 			Prof. Lei Chen
 			Prof. Qiong Luo
 			Prof. Jiheng Zhang (IELM)
 			Prof. Xuemin Lin (Comp. Sci. & Engg., U of NSW)


**** ALL are Welcome ****