Structured Sparsity for Pre-Training Distributed Word Representations with Subword Information

MPhil Thesis Defence


Title: "Structured Sparsity for Pre-Training Distributed Word 
Representations with Subword Information"

By

Mr. Leonard Elias LAUSEN


Abstract

Facilitating computational methods that can “understand” and work with 
humans requires putting the general world knowledge of humans to their 
disposal in a computationally suitable representation (Bengio, Courville, 
and Vincent 2013). Semantic memory refers to this human knowledge and the 
memory system storing it. Computational models thereof have been studied 
since the advent of computing (McRae and M. Jones 2013), typically based 
on text data (Yee, M. N. Jones, and McRae 2018) and a distributional 
hypothesis (Harris 1954; Firth 1957; Miller and Charles 1991), postulating 
a relation between the co-occurrence distribution of sense inputs – such 
as words in language – and their respective semantic meaning.

Next to their use in validating and exploring psychological theories, 
word-based computational semantic models have gained popularity in natural 
language processing (NLP) as word representations obtained from large 
corpora help to improve performance on supervised NLP tasks for which only 
comparatively little labeled training data can be obtained (Turian, 
Ratinov, and Bengio 2010). Recently a series of scalable methods beginning 
with Word2Vec (Tomas Mikolov, Chen, et al. 2013) have enabled the learning 
of word representations form very large unlabeled text corpora, obtaining 
better representations and representations for more words. Unfortunately 
the long-tail nature of human language – implying that most words are 
infrequent (Zipf 1949; Mandelbrot 1954) – prevents them from representing 
infrequent words well (Lowe 2001; Luong, Socher, and Christopher D. 
Manning 2013). These methods are commonly referred to as word embedding 
methods.

Considering that words are typically formed of meaningful parts, the 
distribution considered in the distributional hypothesis depends not only 
on atomic word-level information but is largely based on the subword 
structure (Harris 1954). Taking morphological or subword information into 
account in computational models was therefore proposed as remedy (Luong, 
Socher, and Christopher D. Manning 2013) and recently Bojanowski et al. 
(2017) proposed a scalable model incorporating subword-level information 
termed fastText. fastText is based on learning separate vectorial 
representations for words and their parts, specifically all character 
ngrams. The final word representation provided by the model then is the 
average of the word and ngram level representations.

In this thesis we propose an adaption of the fastText model motivated by 
the insight that estimating the word level part of the representation as 
well as the representations for some character ngrams may be unreliable as 
it is based only on few co-occurrence relations in the text corpus. We 
thus introduce a group lasso regularization (Yuan and Y. Lin 2006) to 
select a subset of word and subword-level parameters for which good 
representations can be learned. For optimization we introduce a scalable 
ProxASGD optimizer based on insights into asynchronous proximal 
optimization by Pedregosa, Leblond, and Lacoste-Julien (2017). We evaluate 
the proposed method on a variety of tasks and find that the regularization 
enables improved performance for rare words and morphologically complex 
languages such as German. By providing separate regularization for subword 
and word level information, the regularization hyperparameters further 
allow trading-off between performance on semantic and syntactic tasks.


Date:			Monday, 15 April 2019

Time:			3:00pm - 5:00pm

Venue:			Room 4621
 			Lifts 31/32

Committee Members:	Prof. Dit-Yan Yeung (Supervisor)
 			Prof. Nevin Zhang (Chairperson)
 			Dr. Yangqiu Song


**** ALL are Welcome ****