Classic and New Data Structure Problems in External Memory

The Hong Kong University of Science and Technology
Department of Computer Science and Engineering


PhD Thesis Defence


Title: "Classic and New Data Structure Problems in External Memory"

By

Mr. Zhewei Wei


Abstract

The demand of efficient data structures for query processing on massive 
data sets has grown tremendously in the past decades. Traditionally, data 
structures are designed and analyzed in the RAM model, where each memory 
cell can be accessed with unit cost. This assumption, however, is 
unrealistic for modeling modern memory hierarchies which consist of many 
levels of memories and caches with different sizes and access costs. As a 
consequence, a number of more elaborate models were introduced. Among them 
the most successful ones are the I/O model and the cache-oblivious model. 
In recent years, designing data structures that are I/O-efficient or 
cache-oblivious has become an active direction in both the theory and 
database communities.

This thesis starts by considering the dictionary problem, one of the most 
basic data structure problems, in which we want to store and access a set 
of (key, data) pairs. Hash tables are the most efficient and the most 
fundamental data structure for implementing a dictionary. In this thesis 
we study hash tables in both the I/O model and the cache-oblivious model. 
We first show an inherent query-insertion tradeoff of hashing in the I/O 
model, which implies that the buffering technique is essentially useless 
for hash tables. In the cache-oblivious model, we build a hash table that 
achieves the same search cost as its cache-aware version does, for all 
block sizes.

The second problem studied is another fundamental data structure problem, 
priority queues. The priority queue problem is well understood in the 
comparison based I/O model, where its complexity is known to be the same 
as sorting. In this thesis, we establish their equivalence in the I/O 
model without any restrictions, by providing a reduction from priority 
queues to sorting. Note that the other direction of the reduction is 
trivial.

The third problem studied in this thesis is the summary query problem, 
which is a natural generalization of the range searching problem. Our goal 
is to design data structures that allow for extracting a statistical 
summary of all the records in the query range.  The summaries we support 
include frequent items, quantiles, various sketches, and wavelets, all of 
which are of central importance in massive data analysis.


Date:			Thursday, 2 February 2012

Time:			1:30pm – 3:30pm

Venue:			Room 3402
 			Lifts 17/18

Chairman:		Prof. Anaimalai MUTHUKRISHNAN (MARK)

Committee Members:	Prof. Ke Yi (Supervisor)
 			Prof. Siu-Wing Cheng
 			Prof. Mordecai Golin
 			Prof. Xiangtong Qi (IELM)
                      	Prof. Yitong Yin (Comp. Sci. & Tech., Nanjing Univ.)


**** ALL are Welcome ****