The Hong Kong University of Science and Technology
Department of Computer Science


PhD Thesis Defence


"Supporting Efficient and Scalable Frequent Pattern Mining"

By

Miss Guimei Liu


Abstract

With large amount of data collected in various applications, data mining
has become an essential tool for data analysis and decision support.
Frequent itemset mining is an important problem in data mining area with a
wide range of applications. In this dissertation, we investigate several
techniques to support efficient and scalable frequent itemset mining.

We first identify the key factors of a frequent itemset mining algorithm,
and propose an algorithm AFOPT for efficient mining of frequent itemsets.
The AFOPT algorithm adopts the pattern growth approach. It uses a new
structure, called ascending frequency ordered prefix-trie (or AFOPT for
short), and two other structures to store conditional databases. An
adaptive strategy is employed to choose proper structures for conditional
databases according to the density of the conditional databases, which
makes AFOPT algorithm perform well on both sparse and dense databases.

To deal with very large databases with millions of transactions, we
propose a Search Space based Partitioning approach SSP for scalable
out-of-core mining. The novel idea of SSP is to partition the database
based on search space. Different partitions share data but not frequent
itemsets. As such frequent itemsets can be mined from each partition
without the need to scan the whole database to verify their supports.
Furthermore, techniques are developed so that available memory can be
utilized during the mining process to minimize disk I/O. Our experiment
results show that the proposed algorithms achieve significant performance
improvement over existing out-of-core algorithms.

Many decision support systems need to support online interactive frequent
itemset mining, which is very challenging because frequent itemset mining
is a computation intensive repeating process. We propose a compact
disk-based data structure---CFP-tree (Condensed Frequent Pattern tree) for
storing pre-computed frequent itemsets to support online mining requests.
Efficient algorithms for retrieving frequent itemsets and association
rules from CFP-tree, as well as efficient algorithms to construct and
update CFP-tree, are developed. One obstacle to online association rule
mining is the undesirably large result size. We introduce the concept of
frequent itemset class and class association rule to reduce result size
without information loss. The benefits of organizing itemsets into classes
are two-fold. On one hand, the output size is reduced and it is easier for
users to browse the results. On the other hand, the computation cost is
saved because the rule generation cost of the itemsets in the same class
is shared. Our performance study demonstrate that with CFP-tree and the
concept of frequent itemset class, frequent itemset and association rule
mining requests can be responded promptly.


Date:			Thursday, 5 May 2005

Time:			2:00p.m.-4:00p.m.

Venue:			Room 2404
			Lifts 17-18

Chairman:		Prof. Allen Moy (MATH)

Committee Members:	Prof. Lionel Ni (Supervisor)
                 	Prof. Frederick H. Lochovsky
                        Prof. Qiang Yang
                        Prof. Fugee Tsung (IEEM)
                        Prof. Jeffrey Xu Yu (Sys. Engg. & Engg. Mgmt., CUHK)


**** ALL are Welcome ****