Approximate Join Processing by Random Sampling

PhD Thesis Proposal Defence


Title: "Approximate Join Processing by Random Sampling"

by

Mr. Bin WU


Abstract:

For many of today’s data-driven analytical tasks, users often need to pose 
ad hoc complex join queries involving multiple relational tables over 
gigabytes or even terabytes of data. When the data set is large, computing 
exact answers can be time-consuming. An important observation is that 
these analytical queries often do not require exact results. Therefore, 
approximate query processing techniques have been extensively studied in 
the literature, which offer users a nice and flexible tradeoff between 
query efficiency and accuracy. While the problem is relatively well solved 
when the query is posed on a single table, current solutions for complex 
join queries involving multiple tables are still far from satisfactory. 
In this thesis proposal, we set to address this latter issue, making 
progress on both the theory and practice of approximate join processing 
through the use of random sampling.

Firstly, we study the problem in a simplified setting, that of estimating 
the number triangles in a large graph, which is self-join over the three 
copies of the same relation with two attributes. This problem has been 
extensively studied in the streaming model, which by definition requires 
at least one pass over the entire data set. Nevertheless, we observe that 
the ideas in many of these streaming algorithms can be coupled with random 
sampling to yield sublinear-time algorithms. We make these random sampling 
triangle counting algorithms more explicit, and present an experimental 
and analytical comparison of different approaches, identifying the best 
performers among a number of candidates.

Next, we study the general problem of approximate processing of complex 
join queries. We generalize our triangle sampling algorithm to arbitrary 
join queries, by performing random walks over the underlying join graph. 
We also design an optimizer that chooses the optimal plan for conducting 
the random walks without having to collect any statistics a priori. 
Extensive experiments using the TPC-H benchmark have demonstrated the 
superior performance of wander join over previous works.

Finally, we have implemented our algorithm in a number of popular database 
systems, demonstrating the practicality of the algorithm and its 
performance in full-fledged systems. This includes PostgreSQL, a widely 
used opensource database system, SparkSQL, the next-generation parallel 
data processing platform designed to scale, and PL/SQL, so that our 
algorithm can be provided as an add-on package to commercial database 
systems whose kernel code is not accessible.


Date:			Tuesday, 17 January 2017

Time:                  	9:30am - 11:30am

Venue:                  Room 5510
                         (lifts 25/26)

Committee Members:	Dr. Ke Yi (Supervisor)
  			Dr. Qiong Luo(Chairperson)
 			Prof. Lei Chen
 			Dr. Raymond Wong


**** ALL are Welcome ****