IMPROVING SEMANTIC SMT FOR LOW RESOURCE LANGUAGES

The Hong Kong University of Science and Technology
Department of Computer Science and Engineering


PhD Thesis Defence


Title: "IMPROVING SEMANTIC SMT FOR LOW RESOURCE LANGUAGES"

By

Miss Meriem BELOUCIF


Abstract

We have managed to consistently improve the translation quality for 
challenging low resource languages by injecting semantic based objective 
functions into the training pipeline at an early (training) rather than 
late (tuning) stage as in previous attempts. The set of approaches 
suggested in this thesis are motivated by the fact that including 
semantics in a late stage tuning of machine translation models has already 
been shown to increase translation quality.

Any shortage of parallel data constitutes a serious obstacle for 
conventional machine translation training techniques, because of their 
heavy dependency on memorization from big data. With low resource 
languages, for which parallel corpora are scarce, it becomes imperative to 
make learning from small data much more efficient by adding additional 
constraints to create stronger inductive biases---especially 
linguistically well-motivated constraints, such as the shallow semantic 
parses of the training sentences. However, while automatic semantic 
parsing is readily available to produce shallow semantic parses for a high 
resource output language (typically English), the problem is that there 
are no semantic parsers for low resource input languages, as in the Uyghur 
or Uzbek translations challenges.

We propose the first ever methods that inject a crosslingual semantic 
based objective function into training translation models for translation 
tasks like Chinese--English where we have semantic parsers for both 
languages. We report promising results showing that this way of training 
the machine translation model, in general, helps bias the learning towards 
semantically more correct bilingual constituents. Semantic statistical 
machine translation for low resource languages has been a difficult 
challenge since semantic parses are not usually available for low resource 
input languages but only for high resource output languages such as 
English. We extend our bilingual approaches to a low resource setup via 
our new training approaches which only require the output language 
semantic parse.

We then thoroughly analyze the reasons behind the promising results that 
we achieved for multiple challenging low resource translation tasks such 
as Hausa, Uzbek and Uyghur, Swahili, Oromo and Amharic always translating 
into English. Our methods heavily rely on the degree of goodness of the 
semantic parser which completely fails to parse any sentence that contains 
any form of the verb to be. Ignoring sentences containing to be means that 
we are ignoring many sentences. Finally, we propose a novel way that 
attempts to semantically parse sentences that contains the to be verb and 
re-run all previous models on this new parsed data. We show even more 
translation improvements through our new proposed approach for many low 
resource languages.


Date:			Monday, 27 March 2018

Time:			10:30 - 13:00

Venue:			CYT G003
 			Lifts 35/36

Chairman:		Prof. Inchi Hui (ISOM)

Committee Members:	Prof. Dekai Wu (Supervisor)
 			Prof. Fangzhen Lin
 			Prof. Xiaojuan Ma
 			Prof. Min Zhang (HUMA)
 			Prof. Martha Palmer (Univ. of Colorado)


**** ALL are Welcome ****