Latent Tree Analysis for Hierarchical Topic Detection: Scalability and Count Data

PhD Thesis Proposal Defence


Title: "Latent Tree Analysis for Hierarchical Topic Detection: Scalability 
and Count Data"

by

Miss Peixian CHEN


Abstract:

Detecting topics and topic hierarchies from large archives of documents 
has been one of the most active research areas in last decade. The 
objective of topic detection is to discover the thematic structure 
underlying document collections, based on which the collections can be 
organized and summarized. Recently, hierarchical latent tree analysis 
(HLTA) is proposed as a new method for topic detection. It differs 
fundamentally from currently predominant topic detection approach,  latent 
Dirichlet allocation (LDA),  in terms of topic definition, topic-document 
relationship, and learning method. HLTA uses a class of graphical models 
called  hierarchical latent tree models (HLTMs) to build a topic 
hierarchy. The variables at the bottom level of an HLTM  are binary 
observed variables that represent the presence/absence of words in a 
document. The variables at other levels are binary latent variables, with 
those at the lowest level representing word co-occurrence patterns and 
those at higher levels representing co-occurrence of patterns at the level 
below. Each latent variable gives a soft partition of the documents, and 
document clusters in the partitions are interpreted as topics.

HLTA has been shown to discover significantly more coherent topics and 
better topic hierarchies than LDA-based hierarchical topic detection 
methods on binary data. However, it has two shortcomings in its current 
form. First, it does not scale up well. It takes, for instance, 17 hours 
to process a NIPS dataset that consists of fewer than 2,000 documents over 
1,000 distinct words.  Second,  it operates on binary data and does not 
take word frequencies into consideration. This leads to significant 
information loss. In this thesis proposal, we propose and investigate 
methods for overcoming those shortcomings.

First, we propose a new algorithm as to scale up HLTA. The computational 
bottleneck of previous HLTA lies in the use of the 
Expectation-Maximization (EM) algorithm for parameter estimation during 
model structure learning, which produces a large number of intermediate 
models. Here we propose progressive EM (PEM) as a replacement of EM. PEM 
is motivated by a spectral technique used in the method of moments, which 
relates model parameters to population moments that involve at most 3 
observed variables. Similarly, PEM carries out parameter estimation in 
submodels that involve 3 or 4 observed binary variables. PEM is efficient 
because however large a dataset is,  it consists of  only 8 or 16 distinct 
cases when projected onto 3 or 4 binary variables. The new algorithm is 
hence named PEM-HLTA. To estimate the parameters of the final model, we 
use stepwise EM, which operates in a way similar to stochastic gradient 
descent. PEM-HLTA finishes processing the aforementioned NIPS data within 
4 minutes under the same computing environment and is capable of analyzing 
much larger corpus.

Second, we propose to incorporate word frequencies into HLTA. For now, 
HLTA models documents as binary vectors. Binary representations capture 
word co-occurrences but reflect little about word proportions in a 
document. Two documents using the same set of words might be on completely 
different topics with different wording preferences. Therefore we propose 
an extension, HLTA for bag-of-words data (HLTA-bow). HLTA-bow replaces the 
binary observed variables in current HLTMs with continuous variables, each 
of which follows a mixture of truncated Gaussian distributions between 
interval [0,1]. These continuous observed variables represent the relative 
frequencies of words in a document. HLTA-bow is hence capable of modeling 
word frequency distributions under different topics, which reflects the 
usage patterns of words instead of pure co-occurrences. Preliminary 
experiments demonstrate that HLTA-bow produces models with much better 
predictive performance compared with LDA-based methods on bag-of-words 
data.


Date:			Wednesday, 26 April 2017

Time:                  	1:30pm - 3:30pm

Venue:                  Room 2463
                         (lifts 25/26)

Committee Members:	Prof. Nevin Zhang (Supervisor)
  			Dr. Raymond Wong (Chairperson)
 			Prof. Fangzhen Lin
 			Dr. Yangqiu Song


**** ALL are Welcome ****