Efficient and Accurate Data Association in Large-Scale Structure-from-Motion and Beyond

The Hong Kong University of Science and Technology
Department of Computer Science and Engineering


PhD Thesis Defence


Title: "Efficient and Accurate Data Association in Large-Scale
Structure-from-Motion and Beyond"

By

Mr. Tianwei SHEN


Abstract

Data association, in the context of Structure-from-Motion (SfM) and 
Simultaneous Localization and Mapping (SLAM), is the process of 
associating uncertain measurements (e.g. image pixels, local descriptors 
and 3D tracks) to the same object or identity. It forms the foundation of 
many 3D computer vision problems, starting from finding local feature 
correspondences, identifying similar images with overlaps, up to bundle 
adjustment and related graph-based optimization problems that seek to 
achieve a harmonious status in terms of geometric and photometric 
quantities. Unlike deterministic pose estimation algorithms that typically 
have closed-form solutions, data association usually works in a noisy 
setting and does not possess an analytical form. Yet, it greatly affects 
the efficiency and accuracy of the reconstruction. In this thesis, we 
explore the elements of the data association problem in the context of 3D 
reconstruction and related problems. More specifically, we first give a 
thorough overview of the modern SfM pipeline, with a focus on the 
functionality of data association in each of its sub-steps. Then we 
describe three novel methods to solve the data association in SfM-related 
3D computer vision problems.

First, we propose a learning-based algorithm for the efficient and 
accurate association of similar images that depict the same scene, which 
often serves as the first step in a large-scale 3D reconstruction to 
accelerate the later image matching pipeline. Though Convolutional Neural 
Networks (CNNs) have achieved superior performance on object image 
retrieval, Bag-of-Words (BoW) models with handcrafted local features still 
dominate the retrieval of overlapping images in 3D reconstruction. We 
narrow down this gap by presenting an efficient CNN-based method to 
retrieve images with overlaps, which we refer to as the matchable image 
retrieval problem. We propose a batched triplet-based loss function 
combined with mesh re-projection to effectively learn the CNN 
representation. The proposed method significantly accelerates the image 
retrieval process in 3D reconstruction and outperforms the 
state-of-the-art CNN-based and BoW methods for matchable image retrieval.

Based on the pairwise image matching, we present match graph construction 
method that tackles the issues of completeness, efficiency and consistency 
in a unified framework. Pairwise image matching of unordered image 
collections greatly affects the efficiency and accuracy of SfM. 
Insufficient match pairs may result in disconnected structures or 
incomplete components, while costly redundant pairs containing erroneous 
ones may lead to folded and superimposed structures. Our approach starts 
by chaining all but singleton images using a visual-similarity-based 
minimum spanning tree. Then the minimum spanning tree is incrementally 
expanded to form locally consistent strong triplets. Finally, a global 
community-based graph algorithm is introduced to strengthen the global 
consistency by reinforcing potentially large connected components. We 
demonstrate the superior performance of our method in terms of accuracy 
and efficiency on both benchmark and Internet datasets. Our method also 
performs remarkably well on the challenging datasets of highly ambiguous 
and duplicated scenes.

The data association problem also widely exists in other domains of 3D 
reconstruction. We describe our contributions in two related problems, 
namely generating consistent textures in image-based modeling, and 
estimating relative camera poses via the deep interplay of photometric and 
geometric information. The first one shares the same graph structure with 
the large-scale SfM problem, while the second combines traditional 
geometric motion estimation method with the recent trend of learning-based 
methods. We bridge the gap between geometric loss and photometric loss by 
introducing the matching loss constrained by epipolar geometry in a 
self-supervised framework. Evaluated on the KITTI dataset, our method 
outperforms the state-of-the-art unsupervised ego-motion estimation 
methods by a large margin. We conclude the thesis by laying out future 
directions of data association with different types of information 
sources.


Date:			Wednesday, 5 June 2019

Time:			2:00pm - 4:00pm

Venue:			Room 3494
 			Lifts 25/26

Chairman:		Prof. Danny Tsang (ECE)

Committee Members:	Prof. Long Quan (Supervisor)
 			Prof. Pedro Sander
 			Prof. Chiew-Lan Tai
 			Prof. Chi-Keung Tang
 			Prof. Ajay Joneja (ISD)
 			Prof. Hongdong Li (Australian National Univ)

**** ALL are Welcome ****