I/O-Efficient Algorithms and Data Structures

Speaker:	Ke YI
		Department of Computer Science
		Duke University

Title:		"I/O-Efficient Algorithms and Data Structures"

Date:		Wednesday, 1 March 2006

Time:		3:30pm - 4:30pm

Venue:		Room 2404 (via lift nos. 17/18)
		HKUST

ABSTRACT:

As the massive data sets we face today far exceed memory limit,
traditional RAM algorithms do not scale due to excessive page faults.  In
recent years, I/O-efficient algorithms have been a successive and active
direction to better scalability on massive data.  In this talk, I will
discuss two topics from my thesis work on I/O-efficient algorithms and
data structures, with their applications in large spatial databases.

The union-find problem is one of the most fundamental algorithmic
problems. However, no I/O-efficient algorithm is known despite extensive
study over the last four decades.  I will present the first I/O-efficient
algorithm for the batched (off-line) union-find problem, and illustrate
how it can be used in some important tasks in geographic information
systems. Experimental results show that the new algorithm gives
order-of-magnitude improvement over previous methods on large data sets
that do not fit in memory.

The R-tree is a disk-based data structure that is commonly used in spatial
databases.  Many R-tree variants have been proposed in the past 20 years
since its invention.  However, all existing R-trees are based on
heuristics and have poor performance on bad data sets.  I will present the
priority R-tree, which is the first R-tree variant that always answers a
window query using the optimal number of I/Os.  Not only being
theoretically optimal, experiments show that the priority R-tree is also
quite competitive compared with other popular R-trees on real-life and
relatively nicely distributed data, and outperforms them significantly on
more extreme data.



*********************
Biography:

Ke YI got his B.E. in Computer Science from Tsinghua University, China, in
2001.  He is currently a Ph.D. candidate in the Department of Computer
Science at Duke University, and is expected to complete all degree
requirements by August, 2006.  During his Ph.D. study he also spent two
summers as an intern at IBM T.J. Watson Research Center and AT&T Labs
-Research, respectively.  His primary research interest is algorithms,
with a focus on I/O-efficient algorithms, but he is also interested in
many problems that arise from database systems.