Accelerated Learning of Latent Tree Models for Topic Detection and Multidimensional Clustering

The Hong Kong University of Science and Technology
Department of Computer Science and Engineering


PhD Thesis Defence


Title: "Accelerated Learning of Latent Tree Models for Topic Detection and 
Multidimensional Clustering"

By

Mr. Tengfei LIU


Abstract

Latent tree models (LTMs) are tree-structured probabilistic graphical 
models where the variables at leaf nodes are observed and the variables at 
internal nodes are latent. They represent complex relationships among 
observed variables and yet are computationally simple to work with.  When 
used for data analysis, they can identify meaningful groupings of observed 
variables, where the correlations among the variables in each group can be 
modeled by a single latent variable. As such, LTMs are effective tools for 
detecting correlation patterns in data and can be applied to various 
domains. In this thesis, we focus on developing new algorithms for 
learning latent tree models and investigating their use in two 
applications.

Firstly, we propose a new method called Bridged-Islands (BI) algorithm for 
learning latent tree models. Many algorithms have been proposed for the 
same task. However, previous methods for learning latent tree models have 
their limitations. For example, the search algorithms can only handle data 
sets with dozens of attributes. The algorithms based on attribute 
clustering (AC) usually focus on latent tree models with special 
structures (e.g., binary trees or forests) which may be not appropriate 
for many real world applications. The algorithms motivated by phylogenetic 
tree reconstruction (PTR) have restrictions on both observed and latent 
variables. They normally require that all observed or latent variables 
have the same number of states and the number should be known beforehand. 
Compared with AC-based algorithms and PTR-motivated algorithms, the 
advantage of BI algorithm is model quality. It produces models with better 
quality and has no restrictions on the cardinalities of the observed 
variables or latent variables. Compared with search algorithms, BI is more 
efficient and can be employed in larger data sets.

Secondly, we propose a new LTM-based algorithm called hierarchical latent 
tree analysis (HLTA) for identifying topics from a collection of 
documents. The predominant method for topic detection is latent Dirichlet 
allocation (LDA), which assumes a generating process for the documents. 
HLTA provides an alternative view for topic detection. It identifies the 
topics by simultaneously clustering the documents and word variables. In 
this new view, a topic is considered as a class of documents that show 
clear thematic meaning. The new method models the word co-occurrence 
patterns by using a hierarchical latent tree model. Each latent variable 
represents a soft partition of the documents based on some word 
co-occurrence patterns. In the hierarchical structure, the variables at a 
higher level correspond to more general topics, while the variables at a 
lower level correspond to more specific topics. Compared with alternative 
methods, HLTA integrates topic correlation modeling, topic structure 
detection and automatic determination of topic numbers in one framework. 
Empirical experiments also show that the new algorithm produces more 
semantically coherent topics and topic hierarchy than LDA-based methods.

Thirdly, we explore the use of LTMs for multidimensional clustering. 
Conventional clustering usually assumes that there is only one true 
clustering within a data set, which does not hold for many real-world data 
that are multifaceted. Multidimensional clustering aims to identify 
multiple clusterings from the data. Latent tree models have previously 
been used for multidimensional clustering, where each latent variable in 
the model represents one way to cluster the data. In this thesis, we apply 
the new algorithm to the task and simultaneously deal with much larger 
data sets than previously. We compared LTM-based methods with several 
alternative algorithms for multidimensional clustering on both labeled and 
unlabeled data sets. The empirical experiments indicate that the LTM-based 
methods can identify a rich set of meaningful clusterings and concurrently 
achieve good performance.


Date:			Monday, 9 March 2015

Time:			3:00pm - 5:00pm

Venue:			Room 5503
 			Lifts 25/26

Chairman:		Prof. Philip Mok (ECE)

Committee Members:	Prof. Nevin Zhang (Supervisor)
 			Prof. Raymond Wong
 			Prof. Dit-Yan Yeung
 			Prof. Weichuan Yu (ECE)
 			Prof. Joydeep Ghosh (Elec. & Comp. Eng., U. of Texas)


**** ALL are Welcome ****