Big Data Analytics with Novel Top-k Query Processing and Classification

PhD Thesis Proposal Defence


Title: "Big Data Analytics with Novel Top-k Query Processing and 
Classification"

by

Mr. Peng PENG


Abstract:

In the era of big data, with the dramatic explosion in both the number of 
records and the number of attributes, making decisions becomes harder and 
harder than before. Traditionally, top-k query processing was focused on 
dealing with the problem of multi-criteria decision making. However, when 
the utility function is unknown, it cannot capture the users' requirements 
since the utility function is regarded as a form of the users' 
requirements. Recently, researchers proposed several novel top-k queries 
such as k-representative skyline queries and k-regret queries, which are 
regarded as the better solutions in the case that the utility function is 
unknown.

Nevertheless, it is still far away from the end of the story. Due to 
the complexity issue, most of these novel top-k queries (without utility 
functions as inputs) cannot be directly applied to the large-scale data 
scenario. Specifically, the algorithms for answering these novel top-k 
queries cannot be easily modified to run in a parallel and distributed 
platform. Another problem is that most existing top-k queries are 
independent of the users' requirements/information when the utility 
function is unknown. In general, even a user may not be able to provide an 
exact utility function, it is possible to obtain his/her partial 
information which can be used as the input of the queries so as to improve 
the quality of the query answers. In the following, we propose two 
directions for addressing the scalability and the personalization issue. 
On one hand, it is possible to extend those traditional techniques for 
top-k query processing in the large-scale data scenario. On the other 
hand, we could design a new type of top-k queries such that each newly 
proposed top-k query can be originally answered through a distributed 
computing platform and incorporates users' information into the answers. 
In my thesis proposal, I mainly give an emphasis on the solutions towards 
the above two directions.

Lastly, I include my research results on an application of top-k 
query processing. In particular, I extend the idea of top-k query 
processing for sampling a training dataset of size k in the problem of 
classification, one of the most fundamental problems in machine learning 
and data mining. The problem of classification can be studied in a big 
data environment. When constructing a training dataset for classification, 
a good sampling strategy is extremely crucial for determining the quality 
of the training dataset. Therefore, a new type of top-k queries can be 
applied here for returning k representative data points from the dataset.


Date:			Wednesday, 6 May 2015

Time:                  	2:00pm - 4:00pm

Venue:                  Room 3494
                         lifts 25/26

Committee Members:	Dr. Raymond Wong (Supervisor)
  			Dr. Huamin Qu (Chairperson)
 			Dr. Lei Chen
  			Dr. Qiong Luo


**** ALL are Welcome ****