The Hong Kong University of Science and Technology Department of Computer Science PhD Thesis Defence "Supporting Efficient and Scalable Frequent Pattern Mining" By Miss Guimei Liu Abstract With large amount of data collected in various applications, data mining has become an essential tool for data analysis and decision support. Frequent itemset mining is an important problem in data mining area with a wide range of applications. In this dissertation, we investigate several techniques to support efficient and scalable frequent itemset mining. We first identify the key factors of a frequent itemset mining algorithm, and propose an algorithm AFOPT for efficient mining of frequent itemsets. The AFOPT algorithm adopts the pattern growth approach. It uses a new structure, called ascending frequency ordered prefix-trie (or AFOPT for short), and two other structures to store conditional databases. An adaptive strategy is employed to choose proper structures for conditional databases according to the density of the conditional databases, which makes AFOPT algorithm perform well on both sparse and dense databases. To deal with very large databases with millions of transactions, we propose a Search Space based Partitioning approach SSP for scalable out-of-core mining. The novel idea of SSP is to partition the database based on search space. Different partitions share data but not frequent itemsets. As such frequent itemsets can be mined from each partition without the need to scan the whole database to verify their supports. Furthermore, techniques are developed so that available memory can be utilized during the mining process to minimize disk I/O. Our experiment results show that the proposed algorithms achieve significant performance improvement over existing out-of-core algorithms. Many decision support systems need to support online interactive frequent itemset mining, which is very challenging because frequent itemset mining is a computation intensive repeating process. We propose a compact disk-based data structure---CFP-tree (Condensed Frequent Pattern tree) for storing pre-computed frequent itemsets to support online mining requests. Efficient algorithms for retrieving frequent itemsets and association rules from CFP-tree, as well as efficient algorithms to construct and update CFP-tree, are developed. One obstacle to online association rule mining is the undesirably large result size. We introduce the concept of frequent itemset class and class association rule to reduce result size without information loss. The benefits of organizing itemsets into classes are two-fold. On one hand, the output size is reduced and it is easier for users to browse the results. On the other hand, the computation cost is saved because the rule generation cost of the itemsets in the same class is shared. Our performance study demonstrate that with CFP-tree and the concept of frequent itemset class, frequent itemset and association rule mining requests can be responded promptly. Date: Thursday, 5 May 2005 Time: 2:00p.m.-4:00p.m. Venue: Room 2404 Lifts 17-18 Chairman: Prof. Allen Moy (MATH) Committee Members: Prof. Lionel Ni (Supervisor) Prof. Frederick H. Lochovsky Prof. Qiang Yang Prof. Fugee Tsung (IEEM) Prof. Jeffrey Xu Yu (Sys. Engg. & Engg. Mgmt., CUHK) **** ALL are Welcome ****