Theory and Practice: Similarity Measurements on Large-Scale Graphs

The Hong Kong University of Science and Technology
Department of Computer Science and Engineering


PhD Thesis Defence


Title: "Theory and Practice: Similarity Measurements on Large-Scale Graphs"

By

Miss Dandan LIN


Abstract:

Measuring node-to-node similarity in a graph is a fundamental tool for 
various graph-based mining applications such as image/information 
retrieval, friend/item recommendation, link prediction, text mining and 
community detection. Thus, studying similarity measures has been 
recognized as an important research problem in the data mining community. 
In this thesis, we focus on three representative similarity measurements 
in the literature: Random Walk with Restart (RWR), Manifold Ranking (MR) 
and First Hitting Probability (FHP), all of which provide good metrics for 
measuring the proximity of two nodes in the graph by considering the local 
structure and the global structure in the graph. Specifically, given a 
graph G and a pair of nodes s, t in G, RWR (resp. MR/FHP) computes a 
"similarity score" of t with respect to (w.r.t) s, which is called the 
RWR (resp. MR/FHP) score. The larger the RWR (resp. MR/FHP) score, the 
more similar to s node t is. However, for these measurements, existing 
best-known algorithms for computing RWR/MR/FHP scores have one of the 
following issues: (1) some of them require to build a bulky index, (2) 
some of them do not have any theoretical bound on the output, and (3) some 
of them are very time-consuming, making both of them difficult to be used 
in real world applications where the graphs are always in a large scale 
with millions of nodes and billions of edges.

In this thesis, we firstly propose an index-free algorithm called 
Residue-Accumulated approach (ResAcc) which returns RWR scores with a 
theoretical guarantee efficiently. Our experimental evaluations on 
large-scale real graphs show that ResAcc is up to 4 times faster than the 
best-known previous algorithm, guaranteeing the same accuracy. Under 
typical settings, the best-known algorithm ran around 1000 seconds on a 
large dataset containing 41.7 million nodes, which is too time-consuming, 
while ResAcc finished in 275 seconds with the same accuracy. Moreover, 
ResAcc is up to 6 orders of magnitude more accurate than the best-known 
algorithm in practice with the same execution time, which is considered as 
a substantial improvement.

Secondly, we study the top-k image retrieval (which returns the k images 
from the image datasets which are the most similar to a given query image) 
with manifold ranking (MR), which could give more accurate results than 
existing deep learning-based approaches. More importantly, compared with 
RWR, MR achieves higher precision on the top-k results since MR treats 
each edge in the graph with different weights while RWR treats equally. 
Besides, we propose 2 algorithms, namely Monte Carlo-based MR (MCMR) and 
MCMR+, for top-k image retrieval. We are the first one to propose an 
index-free manifold ranking image retrieval with the output theoretical 
bound. More importantly, our algorithms give the first best-known time 
complexity result of O(n log n) where n is the total number of images in 
the database compared with the existing best-known result of O(n^2) in the 
literature of computing the exact top-k results with quality guarantee. 
Lastly, our experimental results show that MCMR+ outperforms existing 
algorithms by up to 4 orders of magnitude in terms of query time.

Thirdly, we focus on the group FHP value where the target is a set of 
nodes, instead of a single node, which is applicable to more real-world 
scenarios. Specifically, given a source node s and a target set T, the 
group FHP value f(s, T) is defined as the probability that a random walk 
from s hits one of the nodes in T for the first time. Based on this 
general FHP version, we present the efficient index-free algorithms on two 
types of FHP queries: the pairwise query which returns the FHP value of a 
target set T w.r.t a source node s, and the top-k query which returns the 
k target sets with the largest FHP values w.r.t a source node s. To answer 
the pairwise query, we first present a non-trivial and rigorous derivation 
of a group local push algorithm tailored for FHP (which cannot be found in 
the literature), and then speed up the computations by combining this 
group local push algorithm with the Monte Carlo approach. For top-k 
queries, we present a novel pruning strategy which prunes most non-top-k 
nodes without explicitly computing the lower/upper bounds. By considering 
the properties of FHP, we further present optimizations for the top-k 
queries to improve the practical query efficiency. Extensive experiments 
demonstrate that our proposed solutions run faster than the 
state-of-the-arts by up to 4 orders of magnitude in terms of query time.


Date:			Friday, 15 January 2021

Time:			10:00am - 12:00noon

Zoom Meeting: 
https://hkust.zoom.us/j/94894593324?pwd=QmFxRXd5MUMwV2Y1eEVYbm5ad05KZz09

Chairperson:		Prof. Iam Keong SOU (PHYS)

Committee Members:	Prof. Raymond WONG (Supervisor)
 			Prof. Wilfred NG
 			Prof. Dit Yan YEUNG
 			Prof. Xueqing ZHANG (CIVL)
 			Prof. Hanghang TONG (University of Illinois)


**** ALL are Welcome ****