Efficient Keyword Search in Archival Collections

Speaker:	Dr. Torsten SUEL
		Principal Research Scientist at Yahoo
		and
 		Associate Professor
		Department of Computer and Information Science
		Polytechnic University, Brooklyn, NY

Title:		"Efficient Keyword Search in Archival Collections"

Date:		Monday, 28 April 2008

Time:		4:00pm - 5:00pm

Venue:		Lecture Theatre F
		(Leung Yat Sing Lecture Theatre, near lift nos. 25/26)
		HKUST

Abstract:

Current web search engines focus on searching only the most recent
snapshot of the web. In many cases, however, it would be desirable to
search over collections that include many different crawls and thus many
different versions of each document. Important examples are the Internet
Archive, which has collected multiple snapshots of the web since 1995,
Wikipedia, which keeps track of all versions of each article, or
versioning file systems and revision control systems. Since the sizes of
such archival collections are often much larger than the latest snapshot,
this presents us with significant performance challenges. Current search
engines use many techniques for index compression and optimized query
execution, but these techniques do not exploit the significant
similarities between different versions of a document, or between related
documents.

In this talk, we discuss challenges and research issues in searching and
mining archival text collections. We then propose a framework for indexing
and query processing in archival collections and, more generally, any
collections with a sufficient amount of similarity between documents or
versions. This approach results in significant reductions in index size
and query processing costs on such collections, and it is orthogonal to
and can be combined with existing techniques. It also supports highly
efficient updates, both locally and over a network. We present
experimental results based on general web crawls and Wikipedia data.

[This is joint work with Jiangong Zhang]


************************
Biography:

Torsten SUEL is a Principal Research Scientist at Yahoo! Research, and an
Associate Professor in the Department of Computer and Information Science
at Polytechnic University in Brooklyn, NY. He received a Diplom degree
from the Technical University of Braunschweig (Germany), and a Ph.D. from
the University of Texas at Austin. After postdoctoral research at the NEC
Research Institute, UC Berkeley, and Bell Labs, he joined Polytechnic
University in the Fall of 1998. His main research interests are in the
areas of web search engines and web data mining, algorithms, databases,
and distributed systems.