Vision-Based Sign Language Processing: Recognition, Translation, and Generation

PhD Thesis Proposal Defence


Title: "Vision-Based Sign Language Processing: Recognition, Translation, and
Generation"

by

Mr. Ronglai ZUO


Abstract:

Sign languages, also known as signed languages, are the primary communication
method among the deaf and hard-of-hearing people, using both manual and
non-manual parameters to convey information. These visual languages also have
unique grammatical rules and vocabulary which are usually different with their
spoken language counterparts, resulting in a two-way communication gap between
the deaf and hearing. In this thesis proposal, we will elaborate on our efforts
invested in various fields of sign language processing (SLP): recognition,
translation, and generation, aiming at narrowing the communication gap.

We first focus on the design of sign encoder. Previous sign encoders are mostly
single-modality with a focus on RGB videos, suffering from substantial visual
redundancy such as background and signer appearance. To assist in sign language
modeling, we adopt keypoints that are more robust to visual redundancy and can
highlight critical human body parts, e.g., hands, as an additional modality in
our sign encoder. Representing keypoints as a sequence of heatmaps, estimation
noises can be reduced and the network architecture of keypoint modeling can be
consistent to that of video modeling without any ad-hoc design. The ultimate
sign encoder, namely video-keypoint network (VKNet), has a two-stream
architecture, in which videos and keypoints are processed by each stream and
information are exchanged by inter-stream connections.

VKNet is first applied on continuous sign language recognition (CSLR), the core
task in SLP. Training such a large network is non-trivial because of data
scarcity. Besides using the widely adopted connectionist temporal
classification as the major objective function, we propose a series of
techniques including sign pyramid networks with auxiliary supervision and
self-distillation to ease the training. The overall model is referred to as
VKNet-CSLR. Moving a step forwards, we further extend it to support sign
language translation (SLT) by appending a translation network.

We then move to the fundamental task in SLP: isolated sign language recognition
(ISLR). To improve the model robustness over the large variation in sign
duration, we extend our VKNet to take video-keypoint pairs with varied temporal
receptive field as inputs. Besides, we identify the existence of visually
indistinguishable signs and propose two techniques based on natural language
priors, language-aware label smoothing and inter-modality mixup, to assist in
model training.

In the last work, we target on developing a framework for online CSLR and SLT.
In contrast to previous CSLR works that perform training and inference over
entire untrimmed sign videos (offline CSLR), our framework trains an ISLR model
over short sign clips and makes predictions in a sliding-window manner. The
framework can further be extended to boost offline CSLR performance and to
support online SLT with additional but lightweight networks.

The recognition and translation tasks aim at converting sign videos into
textual representations (gloss or text). As a reverse process, sign language
generation (SLG) systems translate spoken languages into sign languages,
completing the two-way communication loop. We will finally introduce our plan
for building a SLG baseline with 3D avatars.


Date:                   Friday, 23 February 2024

Time:                   1:00pm - 3:00pm

Venue:                  Room 5501
                        Lifts 25/26

Committee Members:      Dr. Brian Mak (Supervisor)
                        Dr. Yangqiu Song (Chairperson)
                        Prof. Raymond Wong
                        Dr. Dan Xu