Incomplete Data Analysis in Smart City Applications

The Hong Kong University of Science and Technology
Department of Computer Science and Engineering


PhD Thesis Defence


Title: "Incomplete Data Analysis in Smart City Applications"

By

Mr. Siyuan Liu


Abstract

Incomplete data in my work is defined as the data with extremely limited 
samples or data observed, which brings big challenges to data mining and 
real applications. Such extremely limited sample data would obviously give 
us terrible bias and inaccurate results. In my work, I studied three real 
applications in smart city and proposed novel algorithms and methodology 
to tackle the incomplete data analysis problem. Application one: Given one 
over ten thousand of the whole set of vehicles in a city, how can we still 
retrieve the vehicle distribution and detect the hot spots or crowded 
areas in the city? The traditional density-based clustering methods work 
not well because of the very limited and errorable vehicle density or 
location information. Hence we need new algorithms to handle such 
incomplete data, in terms of accuracy and scalability. On the other hand, 
the vehicle traces are typical spatio-temporal data, which requires 
efficient approaches. In this paper, we have an interesting observation 
that the vehicle speed can indicate the crowdedness of a given area. In 
other words, if a given area is very crowded, then the vehicles' speed in 
this area is low; while if this area is not crowded, then the vehicles' 
speed in this area prefers high. As such the mobility of samples is 
naturally incorporated and a novel non-density-based clustering method is 
developed, called mobility-based clustering. Several key factors beyond 
the vehicle crowdedness have been identified and techniques to compensate 
these effects are proposed. We evaluate the performance of mobility-based 
clustering based on real traffic situations. Experimental results show 
that using 0.3 % of vehicles as the samples, mobility-based clustering can 
accurately identify hot spots which can hardly be obtained by the latest 
representative algorithm UMicro. Application two: Given one over one 
hundred thousand of the whole set of population in a city, how can we 
still retrieve the population distribution and detect the population 
bursts in the city? To address the difficulties of lacking real population 
data, we take the advantage of mobile phone networks, which offer enormous 
spatiotemporal communication data among persons. More importantly, we find 
the fact that we can utilize these mobile phone data to infer and 
approximate the population data. Thus, we can study the population burst 
detection problem taking advantages of the unique features hidden in the 
mobile phone data. In this work, we present PBD, a system to conduct 
Population Burst Detection. First, we propose an effective clustering 
method, correlation-based clustering, to cluster the incomplete location 
information from the mobile phone data. Then, we design an adaptive 
parameter-free detection method, R-scan, to capture the distributed 
dynamic bursts. Finally, we devise an efficient algorithm, BT-miner, to 
retrieve burst trajectories. The experimental results on real life mobile 
phone data confirm the effectiveness and efficiency of the proposed 
algorithms. Application three: We study a very interesting and practical 
problem, pattern matching in a distributed mobile environment. Pattern 
matching is a well-known problem and extensive research has been conducted 
for performing effective and efficient search. However, previous 
approaches assume that data are centrally stored, which is not the case in 
a mobile environment (e.g., mobile phone networks), where one person's 
pattern could be separately stored in a number of different stations, and 
such a local pattern is incomplete compared with the global pattern. A 
simple solution to pattern matching over a mobile environment is to 
collect all the data distributed in base stations to a data center and 
conduct pattern matching at the data center afterwards. Clearly, such a 
simple solution will raise huge amount of communication cost, which may 
cause the traffic congestion brought by the limited wireless bandwidth to 
be even worse. Therefore, a communication efficient and search effective 
solution is necessary. In this work, we propose a novel solution based on 
a well-designed Weighted Bloom Filter (WBF), called, Distributed 
Incomplete pattern matching (DI-matching), to find target patterns over a 
distributed mobile environment. Specifically, to save communication cost 
and ensure pattern matching in distributed incomplete patterns, we use WBF 
to encode a query pattern and disseminate the encoded data to each base 
station. Each base station conducts a local pattern search according to 
the received WBF. Only qualified IDs and corresponding weights in each 
base station are sent to the data center for aggregation and verification. 
Through non-trivial theoretical analysis and extensive empirical 
experiments on a real city-scale mobile networks data set, we demonstrate 
the effectiveness and efficiency of our proposed solutions.


Date:			Thursday, 18 August 2011

Time:			3:00pm – 5:00pm

Venue:			Room 3501
 			Lifts 25/26

Chairman:		Prof. Hong-Kam Lo (CIVL)

Committee Members:	Prof. Lionel Ni (Supervisor)
 			Prof. Lei Chen
 			Prof. Shing-Chi Cheung
 			Prof. Guohua Chen (CBME)
                         Prof. Xiaohua Jia (Comp. Sci., CityU)


**** ALL are Welcome ****