Latent Tree Analysis for Hierarchical Topic Detection: Scalability and Count Data

The Hong Kong University of Science and Technology
Department of Computer Science and Engineering


PhD Thesis Defence


Title: "Latent Tree Analysis for Hierarchical Topic Detection: Scalability and 
Count Data"

By

Miss Peixian CHEN


Abstract

Detecting topics and topic hierarchies from large archives of documents has 
been one of the most active research areas in last decade. The objective of 
topic detection is to discover the thematic structure underlying document 
collections, based on which the collections can be organized and summarized. 
Recently,  hierarchical latent tree analysis (HLTA) is proposed as a new method 
for topic detection. It uses a class of graphical models called hierarchical 
latent tree models (HLTMs) to build a topic hierarchy. The variables at the 
bottom level of an HLTM  are binary observed variables that represent the 
presence/absence of words in a document. The variables at other levels are 
binary latent variables that represent word co-occurrence patterns with 
different granularities. Each latent variable gives a soft partition of the 
documents, and document clusters in the partitions are interpreted as topics.

HLTA has been shown to discover significantly better models, more coherent 
topics and topic hierarchies than the state-of-the-art LDA-based hierarchical 
topic detection methods. However, HLTA in its current form can hardly  be 
recognized as a practical topic detection tool. First, HLTA has rather 
prohibitive computational cost; Second,  HLTA only operates on binary data. In 
this thesis, we propose and investigate methods to overcome those shortcomings.

First, we propose a new learning algorithm PEM-HLTA as to scale up HLTA. HLTA 
consists of two phases:  model construction phase and parameter estimation 
phase. The computational bottleneck of HLTA lies in the use of the EM algorithm 
for evaluating parameters during model construction phase, which produces a 
large number of intermediate models. Here we propose progressive EM (PEM) as a 
replacement of EM. PEM carries out parameter evaluation in submodels that 
involve only 3 or 4 observed binary variables and gains great speed-up. 
Combined with the accelerating techniques applied to the parameter estimation 
phase, PEM-HLTA is capable of analyzing much larger corpus with over hundreds 
of thousands of documents.

Second, we propose an extension HLTA-c to incorporate word counts into 
PEM-HLTA. The incapability of dealing with count data has always put HLTA at a 
disadvantage as a topic detection method. We introduce real-valued continuous 
variables to replace the observed binary variables in HLTMs. This is done in 
parameter estimation phase and allows PEM-HLTA to model word frequency 
distributions under different topics, which reflects the usage patterns of 
words instead of pure word co-occurrences.

HLTA-c is now a new state-of-the-art topic detection approach with the 
aforementioned improvements on scalability and model flexibility. Empirical 
results show that HLTA-c achieves  efficiency comparable with the best 
LDA-based hierarchical topic detection methods, and excels in model predictive 
performance, topic coherence and topic hierarchy quality.


Date:			Wednesday, 23 August 2017

Time:			2:00pm - 4:00pm

Venue:			Room 2612B
 			Lifts 31/32

Chairman:		Prof. Jeffrey Chasnov (MATH)

Committee Members:	Prof. Nevin Zhang (Supervisor)
 			Prof. Lei Chen
 			Prof. Wilfred Ng
 			Prof. Weichuan Yu (ECE)
 			Prof. Wai Lam (Sys Engg & Engg Mgmt, CUHK)


**** ALL are Welcome ****