Duplicate Detection in XML Web Data

MPhil Thesis Defence


Title: "Duplicate Detection in XML Web Data"

By

Mr. Yuzhou Huang


Abstract

Duplicate entities are quite common on the Web, where structured XML data 
are increasingly common. Duplicate detection, which is considered an 
important data cleaning task, consists of detecting different 
presentations of the same real world object. Detecting and resolving 
duplicate entities will certainly be of benefit to Web users. Thus, to 
improve the web data quality, algorithms for detecting duplicates are 
required.

In this thesis, we present a feature-dependent algorithm, which efficiently 
identifies duplicates in XML Web data. First, we generate features which are 
related to the targeted duplicates. Then, we create a function which is used 
for the similarity measurements, based on the generated features. A threshold 
is used to help identify whether the identified duplicates are real duplicates. 
We also introduce another step, similarity function learning, to improve the 
duplicate detection results.

To prove that the above methodology can be broadly applied, we apply the 
algorithm on different kinds of XML Web data, which can be easily found on 
websites. We also use various entities as the duplicates in the experiments, 
such as CD name entities and author entities. Moreover, we generate some dirty 
data manually to show that our algorithm can work well even when there are some 
errors or missing information in the datasets.


Date:			Monday, 4 May 2009

Time:			3:00pm – 5:00pm

Venue:			Room 3501
 			Lifts 25-26

Committee Members:	Prof. Frederick Lochovsky (Supervisor)
 			Dr. Wilfred Ng (Chairperson)
 			Dr. Lei Chen


**** ALL are Welcome ****