Accelerated Learning of Latent Tree Models for Topic Detection and Multidimensional Clustering

PhD Thesis Proposal Defence


Title: "Accelerated Learning of Latent Tree Models for Topic Detection 
and Multidimensional Clustering"

by

Mr. Tengfei LIU


Abstract:

Latent tree models (LTMs) are tree-structured probabilistic graphical 
models where the variables at leaf nodes are observed and the variables at 
internal nodes are latent. They represent complex relationships among 
observed variables and yet are computationally simple to work with. When 
used for data analysis, they can identify meaningful groupings of observed 
variables, where the correlations among the variables in each group can be 
modeled by a single latent variable. As such, LTMs are effective tools for 
detecting correlation patterns in data and can be applied to various 
domains. In this thesis, we focus on developing new algorithms for 
learning latent tree models and investigating their use in two 
applications.

Firstly, we propose a new method called Bridged-Islands (BI) algorithm for 
learning latent tree models. Many algorithms have been proposed for the 
same task. However, previous methods for learning latent tree models have 
their limitations. For example, the search algorithms can only handle data 
sets with dozens of attributes which limits their use in processing large 
data set. The algorithms based on attribute clustering (AC) usually focus 
on latent tree models with special structures (e.g., binary trees or 
forests) which may be not appropriate for many real world applications. 
The algorithms motivated by phylogenetic tree reconstruction (PTR) have 
restrictions on both observed and latent variables. They normally require 
that all observed or latent variables have the same number of states and 
the number should be known beforehand. Compared with AC-based algorithms 
and PTR-motivated algorithms, the advantage of BI algorithm is model 
quality. It produces models with better quality and has no restrictions on 
the cardinalities of the observed variables or latent variables. Compared 
with search algorithms, BI is more efficient and can be employed in larger 
data sets.

Secondly, we propose a new LTM-based algorithm called hierarchical latent 
tree analysis (HLTA) for identifying topics from a collection of 
documents. The predominant method for topic detection are latent Dirichlet 
allocation (LDA), which assume a generating process for the documents. 
HLTA provides an alternative view for topic detection. It identifies the 
topics by simultaneously clustering the documents and word variables. In 
this new view, a topic is considered as a class of documents that show 
clear thematic meaning. The new method models the word co-occurrence 
patterns by using a hierarchical latent tree model. Each latent variable 
represents a soft partition of the documents based on some word 
co-occurrence patterns. Each state of the latent variable, which 
represents a cluster of documents in the partition, is interpreted as a 
topic. In the hierarchical structure, the variables at a higher level 
correspond to more general topics, while the variables at a lower level 
correspond to more specific topics. Compared with alternative methods, 
HLTA integrates topic correlation modeling, topic structure detection and 
automatic determination of topic numbers in one framework. Empirical 
experiments also show that the new algorithm produces more semantically 
coherent topics and topic hierarchy than LDA-based methods.

Thirdly, we explore the use of LTMs for multidimensional clustering. 
Conventional clustering usually assume that there is only one true 
clustering within a data set, which does not hold for many real-world data 
that are multifaceted. Multidimensional clustering aims to identify 
multiple clusterings from the data. Latent tree models have previously 
been used for multidimensional clustering, where each latent variable in 
the model represents one way to cluster the data. In this thesis, we apply 
the new algorithm to the task and simultaneously deal with much larger 
data sets than previously. We compared LTM-based methods with several 
alternative algorithms for multidimensional clustering on both labeled and 
unlabeled data sets. The empirical experiments indicate that the LTM-based 
methods can identify a rich set of meaningful clusterings and concurrently 
achieve good performance.


Date:			Monday, 1 December 2014

Time:                   9:30am - 11:30am

Venue:                  Room 3501
                        lifts 25/26

Committee Members:	Prof. Nevin Zhang (Supervisor)
 			Prof. Dit-Yan Yeung (Chairperson)
 			Dr. Huamin Qu
 			Dr. Raymond Wong


**** ALL are Welcome ****