Hadoop-based Storage System for Big Spatio-Temporal Data Analytics

The Hong Kong University of Science and Technology
Department of Computer Science and Engineering


PhD Thesis Defence


Title: "Hadoop-based Storage System for Big Spatio-Temporal Data Analytics"

By

Mr. Haoyu Tan


Abstract

During the past decade, various GPS-equipped devices have generated a 
tremendous amount of data with time and location information, which we 
refer to as big spatio-temporal data. As the size of the data is 
continuously growing, it will outgrow the capabilities of any serial 
processing techniques and it is therefore necessary to perform the data 
analytics in parallel.

There are two main paradigms for large scale data processing: parallel 
relational database management system (RDBMS) and MapReduce. The debate on 
which paradigm is superior to the other has lasted for several years, 
which led to a widely accepted view that there are advantages to both 
paradigms in different aspects. It was once believed that RDBMS can 
deliver a better performance while MapReduce can scale out more easily due 
to its emphasis on fault tolerance design. However, recent works from both 
sides demonstrate that techniques used in one paradigm can be incorporated 
into another to fix the deficiencies. In context of our research, we use 
Hadoop, an open-source implementation of MapReduce and related components, 
to perform spatio-temporal data analytics. The main consideration is that 
Hadoop provides us with low-level application programming interface (API) 
which is more flexible for implementing complex data mining algorithms 
than structured query language (SQL) supported by RDBMS.

In this thesis, we first describe the design and implementation of CloST, 
a scalable big spatio-temporal data storage system to support data 
analytics using Hadoop. The main objective of CloST is to avoid scan the 
whole dataset when a spatio-temporal range is given. To this end, we 
propose a novel data model which has special treatments on three core 
attributes including an object id, a location and a time. Based on this 
data model, CloST hierarchically partitions data using all core attributes 
which enables efficient parallel processing of spatio-temporal range 
scans. According to the data characteristics, we devise a compact storage 
structure which reduces the storage size by an order of magnitude. In 
addition, we proposes scalable bulk loading algorithms capable of 
incrementally adding new data into the system. Then we address the problem 
of parallel creation of secondary indexes in CloST. Particularly, we 
present the design of a general framework for parallel R-tree packing 
using Hadoop. This framework sequentially packs each R-tree level from 
bottom up. For lower levels that have a large number of rectangles, we 
propose a partition-based algorithm for parallel packing. We also discuss 
two spatial partitioning methods that can efficiently handle heavily 
skewed datasets.

To evaluate the performance, we conduct extensive experiments using large 
real datasets. The size of the datasets is up to 200GB and the number of 
spatial objects is up to 2 billion. The results show that CloST has fast 
data loading speed, high scalability in query processing, high data 
compression ratio, and desirable performance in building very large 
secondary spatial indexes (R-trees).


Date:			Thursday, 29 November 2012

Time:			12:30pm - 2:30pm

Venue:			Room 3588
 			Lifts 27/28

Chairman:		Prof. Zongjin Li (CIVL)

Committee Members:	Prof. Lionel Ni (Supervisor)
 			Prof. Lei Chen
 			Prof. Qiong Luo
 			Prof. Furong Gao (CBME)
                        Prof. Qing Li (Comp. Sci., CityU)


**** ALL are Welcome ****