Unidimensional Clustering using Latent Tree Models

PhD Thesis Proposal Defence


Title: "Unidimensional Clustering using Latent Tree Models"

by

Miss Hua LIU


Abstract:

Cluster Analysis is an important research topic in machine learning and 
data mining. The objective is to partition data into meaningful clusters, 
where the number of clusters might be known or unknown. A variety of 
approaches have been proposed for the task. Examples include 
distance/similarity-based algorithms such as K-means, kernel K-means and 
spectral clustering, as well as model-based methods such as Gaussian 
mixture models (GMMs). Most previous clustering algorithms are targeted 
for continuous data, that is, points in the Euclidean space of certain 
dimension. In this thesis, we focus on discrete data where each attribute 
takes a finite number of possible values.

A commonly used method for clustering discrete data is latent class 
analysis (LCA). LCA is based on a class of finite mixture models known as 
latent class models (LCMs). An LCM consists of a latent variable and a 
number of attributes. It assumes that the attributes are mutually 
independent given the latent variable. Known as local independence, this 
assumption is often too strong in practice, and might lead to spurious 
clusters. Another method for clustering discrete data is called latent 
tree analysis (LTA) and it is based latent tree models (LTMs). LTMs allow 
multiple latent variables and hence the local independence is relaxed. 
Each latent variable represents a soft partition of data. As such, LTA 
yields multiple partitions of data and is multidimensional clustering 
method. While multiple partitions are desirable in some applications, 
there are many situations where one wants only one partition of data.

In this thesis, we develop and investigate a novel method for clustering 
discrete data that yields a single partition of data and does not make the 
local independence assumption. The method learns, from data, a three-level 
LTM where (1) the attributes are at the bottom, (2) there is a unique 
latent variable, the root, at the top level, and (3) there are multiple 
latent variables at the middle level that connect the attributes and the 
root. The root represents the data partition to be obtained. It is not 
directly connected to the attributes, and hence the local independence 
assumption is relaxed. The method is called UC-LTM, which stands for 
unidimensional clustering based on latent tree models.

In addition to the general-purpose algorithm UC-LTM, we also propose and 
investigate a special-purpose unidimensional clustering method for solving 
the rounding problem of spectral clustering. Spectral clustering starts 
with a similarity matrix for a collection of data points, transforms the 
matrix to get the so-called Laplacian matrix, finds the eigenvectors of 
the Laplacian matrix, and obtains a partition of the data points using the 
leading eigenvectors. The last step is called rounding and it consists of 
three sub-problems: (1) decide how many leading eigenvectors to use, (2) 
determine the number of clusters, and (3) partition the data points. To 
solve the rounding problem, we treat the eigenvectors as features, 
discretize them, and then learn a three-level LTM to cluster the data 
points. Our method differs from previous rounding methods in three ways. 
First, we relax the assumption that the number of clusters equals the 
number of eigenvectors used. Second, when deciding how many leading 
eigenvectors to use, we not only rely on information contained in the 
leading eigenvectors themselves, but also make use of the subsequent 
eigenvectors. Third, our method is model-based and solves all the three 
sub-problems simultaneously.


Date:			Tuesday, 19 May 2015

Time:                   3:00pm - 5:00pm

Venue:                  Room 2132C
                         lift 19

Committee Members:	Prof. Nevin Zhang (Supervisor)
  			Prof. Fangzhen Lin (Chairperson)
 			Dr. Lei Chen
  			Dr. Wilfred Ng


**** ALL are Welcome ****