Practical Improvements to Automatic Visual Speech Recognition

MPhil Thesis Defence


Title: "Practical Improvements to Automatic Visual Speech Recognition"

By

Mr. Ho Long FUNG


Abstract

Visual speech recognition (a.k.a lipreading) is the task of recognizing 
speech solely from the visual movement of the mouth. In this work, we 
propose multiple feasible and practical strategies, and demonstrate 
significant improvements to the established competitive baselines in both 
low-resource and resource-sufficient scenarios.

On one hand, one main challenge in practical automatic lipreading is to 
deal with the diverse facial viewpoints in the available video data. With 
the recent proposal of the spatial transformer, the spatial invariance to 
input data in the convolutional neural network has been enhanced and it 
has demonstrated different levels of success in a broad spectrum of areas 
including face recognition, facial alignment and gesture recognition with 
promising results by virtue of the increased model robustness to viewpoint 
variations in the data. We study the effectiveness of the learned spatial 
transformation to our model through quantitative and qualitative analysis 
with visualizations and attain an absolute accuracy gain of 0.92% to our 
data-augmented baseline on the resource-sufficient Lip Reading in the Wild 
(LRW) continuous word recognition task with incorporation of spatial 
transformer.

On the other, we explore the effectiveness of convolutional neural network 
(CNN) and long short-term memory (LSTM) recurrent neural network in 
lip-reading under a low-resource scenario that has not yet been explored 
before. We propose an end-to-end deep learning model fusing conventional 
CNN and bidirectional LSTM (BLSTM) together with maxout activation units 
(maxout-CNN-BLSTM) and dropout, which is capable of attaining a word 
accuracy of 87.6% on the low-resource Ouluvs2 corpus, offering an absolute 
improvement of 3.1% to the previous state-of-the-art auto-encoder-BLSTM 
model at that time. To emphasize, our lip-reading system does not require 
any separate feature extraction stage nor pre-training phase with external 
data resources.


Date:			Wednesday, 12 December 2018

Time:			2:30pm - 4:30pm

Venue:			Room 2131C
 			Lift 19

Committee Members:	Dr. Brian Mak (Supervisor)
 			Prof. Dit-Yan Yeung (Chairperson)
 			Dr. Raymond Wong


**** ALL are Welcome ****