An Adaptive Framework for Searching XML Documents

The Hong Kong University of Science and Technology
Department of Computer Science and Engineering


PhD Thesis Defence


Title: "An Adaptive Framework for Searching XML Documents"

By

Mr. Ho-Lam Lau


Abstract

The evolution of computing technology suggests that it has become more
feasible to offer access to Web information in a ubiquitous way, through
various kinds of interaction devices such as PCs, laptops, palmtops, and
so on. As XML has become a defacto standard for exchanging Web data, an
interesting and practical research problem is the development of models
and techniques to satisfy various needs and preferences in searching XML
data.

In this thesis, we employ a list of simple XML tagged keywords as a
vehicle for searching XML fragments in a collection of XML documents. In
order to deal with the diversified nature of XML documents as well as user
preferences, we propose a novel Multi-Ranker Model (MRM), which is able to
abstract a spectrum of important XML properties and adapt the features to
different XML search needs.

The MRM is composed of three ranking levels. The lowest level consists of
two categories of similarity and granularity features. At the intermediate
level, we define four tailored XML Rankers (XRs), which consist of
different lower level features and have different strengths in searching
XML fragments. The XRs are trained via a learning mechanism called the
Ranking Support Vector Machine in a voting Spy Naïve Bayes Framework
(RSSF). The RSSF takes as input a set of labeled fragments and feature
vectors and generates as output Adaptive Rankers (ARs) in the learning
process. The ARs are defined over the XRs and generated at the top level
of the MRM.

We show empirically that the RSSF is able to improve the MRM significantly
in the learning process that needs only a small set of training XML
fragments. We demonstrate that the trained MRM is able to bring out the
strengths of the XRs in order to adapt different preferences and queries.

We also present the Adaptive Information Merging Approach (AIM) to merge
the XML fragments returned from the ranked result list. We incorporate the
users' feedbacks in order to further improve the coverage and specificity
of the merged results, which are measured in terms of two formal notions
of Information Completeness (IC) and Data Complexity (DC). IC represents
source coverage and computes the "completeness" of those involved
information sources and DC represents "richness" of data and computes the
complexity of the retrieved data items.


Date:                   Friday, 5 October 2007

Time:                   2:00p.m.-4:00p.m.

Venue:                  Room 3301
                        Lifts 17-18

Chairman:               Prof. Kani Chen (MATH)

Committee Members:      Prof. Wilfred Ng (Supervisor)
                        Prof. Frederick Lochovsky
                        Prof. Vincent Shen
                        Prof. Andrew Lim (IELM)
                        Prof. Jeffrey Xu Yu (Sys.Engg. & Engg.Mgmt.,CUHK)


**** ALL are Welcome ****