A Novel Scalable Join Processor over Large RDF Graphs with Linkage Information Aware

MPhil Thesis Defence


Title: "A Novel Scalable Join Processor over Large RDF Graphs with Linkage Information 
Aware"

By

Mr. Yincheng Lin


Abstract

RDF(Resource Description Framework), which is developed by W3C, is a web 
semantic data description format. With the development of semantic web, RDF 
data integrated from many sources become larger and larger. Because of its 
large amount and free schema, the efficiency of RDF data processing still 
remains a major challenge in the RDF data management. Many research works have 
been carried out to issue this problem. The idea of property table tries to 
discover the correlation among the predicates and stores the related date in 
the same table so that query processing could be executed in the way just as we 
do in the relational database. Column store focuses on each individual 
predicate. It partitions the RDF data into different tables based on the 
corresponding predicates and builds the indices for each table. RDF-3X, a 
RISC-style engine to manage the RDF data efficiently, keeps the original triple 
format of RDF data and store them directly and builds all possible permutation 
of indices.

In this thesis, we step further to discover potential properties of RDF data 
and make full use of them to process queries efficiently. To be more specified, 
we introduce 1) Two linkage structures: star linkage and chain linkage. We 
extract this structure information, store it separately and build the 
aggregated indices on it. 2) For the data which doesn't contain structure 
information, we store it in different tables based on the predicate, which is 
similar to column store. However, the big difference between our storage and 
the column store is that we treat the predicates not equally. We observe that 
there are some predicates which are multiple value predicates. For this kind of 
predicates,instead of using B+ tree index,  we use a local bitmap index which 
is more suitable for it and improve the query performance. 3) In order to gain 
a query plan with high performance, we introduce a more complex and more 
accurate selectivity estimation which actually doesn't need extra time cost 
compared with the traditional estimation. We evaluate our approach over two 
different RDF datasets, Billion Triple Challenge and Yago, and develop 
different kinds of possible queries. Compared with RDF-3X and monetDB, the 
performance of our approach is better, especially for some queries with star 
linkage or chain linkage information.


Date:			Wednesday, 24 August 2011

Time:			2:00pm – 4:00pm

Venue:			Room 5509
 			Lifts 25/26

Committee Members:	Dr. Lei Chen (Supervisor)
 			Dr. Ke Yi (Chairperson)
 			Dr. Charles Zhang


**** ALL are Welcome ****