Multidimensional Text Clustering for Hierarchical Topic Detection
(† The Hong Kong University of Science and Technology
‡ The Education University of Hong Kong)
Text clustering is generally considered unsuitable for topic detection because it associates each document with only one ˇ°topicˇ± (i.e., document cluster). Recent advances in model-based multidimensional clustering have overcome the difficulty, and have given rise to a novel approach to hierarchical topic detection that outperforms the LDA approach in empirical studies.
The new approach is called hierarchical latent tree analysis (HLTA). The idea is to model document collections using a class of graphical models called hierarchical latent tree models (HLTMs). The variables at the bottom level of an HLTM are observed binary variables that represent the presence/absence of words in a document. The variables at other levels are binary latent variables. The latent variables at the second level model word co-occurrence patterns, and those at higher levels model co-occurrences of patterns at the level below.
Each latent variable gives a partition of the documents, and the document clusters in the partitions are interpreted as topics. The topics at high levels of the hierarchy capture ˇ°long-rangeˇ± word co-occurrences and hence are thematically more general, while the topics at low levels capture ˇ°short-rangeˇ± word co-occurrences and hence are thematically more specific.
ˇ¤ Multidimensional clustering and latent tree models
ˇ¤ Latent tree models for topic detection
ˇ¤ Results on the New York Times dataset
ˇ¤ The HLTA Algorithm
ˇ¤ Comparisons with the LDA approach
ˇ¤ Analysis of IJCAI/AAAI papers (2000-2015)
LDA Approach (nHDP)
Part of Model
300,000 articles from New York Times (1987-2007)
AAAI/IJCAI papers (2000-2015)
Note: The topic tree for AAAI/IJCAI papers will take 1-2 minutes to load. Click on a topic to show the documents belonging to that topic and the counts by year.
ˇ¤ P. Chen, N.L. Zhang, et al. Latent Tree Models for Hierarchical Topic Detection. Artificial Intelligence, 250:105–124, 2017.
ˇ¤ P. Chen, N.L. Zhang, et al. Progressive EM for Latent Tree Models and Hierarchical Topic Detection. AAAI 2016.
ˇ¤ T. Liu, N.L. Zhang, P. Chen. Hierarchical Latent Tree Analysis for Topic Detection. ECML/PKDD (2) 2014: 256-272
ˇ¤ R. Mourad, C. Sinoquet, N. L. Zhang, T.F. Liu and P. Leray (2013). A survey on latent tree models and applications. Journal of Artificial Intelligence Research, 47, 157-203
ˇ¤ T.Liu, N.L. Zhang, et al. Greedy learning of latent tree models for multidimensional clustering. Machine Learning 98(1-2): 301-330 (2015)
ˇ¤ T. Chen, N. L. Zhang, T. F. Liu, Y. Wang, L. K. M. Poon (2012). Model-based multidimensional clustering of categorical data. Artificial Intelligence, 176(1), 2246-2269.
ˇ¤ Paisley, J., Wang, C., Blei, D. M., and Jordan, M. I. 2012. Nested hierarchical Dirichlet processes. IEEE Transactions on Pattern Analysis and Machine Intelligence, 37.
ˇ¤ Blei, D. M., Griffiths, T. L., and Jordan, M. I. 2010. The nested Chinese restaurant process and Bayesian nonparametric inference of topic hierarchies. Journal of the ACM, 57(2):7:1¨C7:30.
Part of the Model for IJCAI/AAAI Papers (click here for details)