A deep discriminative representational framework for recovering categorical 3D object attributes from visual data

The Hong Kong University of Science and Technology
Department of Computer Science and Engineering


PhD Thesis Defence


Title: "A deep discriminative representational framework for recovering 
categorical 3D object attributes from visual data"

By

Mr. Shichao LI


Abstract

How to recognize 3D properties from 2D RGB images is a fundamental problem in 
computer vision, which enables enormous applications such as human-computer 
interaction, traffic surveillance, autonomous perception, and augmented 
reality. This problem is challenging due to the loss of depth information in 
the image formation process and a large variation of depth, object geometry, 
and scene illumination. In early studies, David Marr proposed a 
representational framework of vision that begins with a low-level primal 
sketch, which progresses to an intermediate 2.5D sketch and finally a 3D model 
representation. With an increasing amount of 3D labels, recent deep 
representation learning approaches instantiate such a framework by learning 
from data in an end-to-end manner. This thesis explores such instantiations for 
a set of representative 3D perception problems. After introducing the 
background and the niche of this thesis, these studies are presented in an 
order of increasing number of 3D attributes, camera views, and system 
capabilities.

We first study the problem of recognizing the 3D orientation of vehicles from a 
single RGB image. In contrast to prior arts that directly regress the angular 
values with a deep neural network, we propose a progressive approach by 
learning geometry-aware representations with perspective points which achieves 
improved model generalization. We encode the prior knowledge of a projective 
invariant into the training process to further improve the representation 
learning with extra unlabeled images.

Secondly, we study inferring non-rigid 3D posture of humans from single-view 
images. We discovered a dataset bias problem in the training phase and propose 
the first method to incorporate synthetic data into the training phase of 
2D-to-3D networks to achieve better model generalization to unseen inputs.

We then extend the study of rigid pose estimation problems to two-views and 
study learning voxel-based representations for stereo 3D object detection. We 
propose a new multi-resolution approach that enables high-resolution modeling 
of object regions and design a new instance-level model to achieve high 
precision and transferable pose refinement.

Finally, we push the capability of the perception model to go beyond the rigid 
pose estimation and achieve fine-grained shape inference, making it more 
similar to the binocular human vision system. We design the first model for 
joint stereo 3D object detection and implicit shape estimation with a new 
instance-level model that infers shape with intermediate point-based 
representations. We further extend the pose refinement studies to the non-rigid 
object classes such as pedestrians and cyclists.


Date:			Tuesday, 16 August 2022

Time:			2:00pm - 3:40pm

Zoom Meeting: 		https://hkust.zoom.us/j/9838391022

Chairperson:		Prof. Weiyin HONG (ISOM)

Committee Members:	Prof. Tim CHENG (Supervisor)
 			Prof. Qifeng CHEN
 			Prof. Dan XU
 			Prof. Weichuan YU (ECE)
 			Prof. Hongsheng LI (CUHK)


**** ALL are Welcome ****