Towards Good Utilisation of Crowd (Stack Overflow) Wisdom

The Hong Kong University of Science and Technology
Department of Computer Science and Engineering


PhD Thesis Defence


Title: "Towards Good Utilisation of Crowd (Stack Overflow) Wisdom"

By

Mr. Fuxiang CHEN


Abstract

Stack Overflow (SO), established in 2008, is a community question 
answering forum tailored specially for developers. It is widely and 
actively used by developers. Today, there are more than 40 million 
questions and answers residing in SO, and this number is expected to grow 
over time. Despite the huge amount of invaluable information residing in 
SO, taking full advantage of it has been challenging, mainly due to the 
interleaving of unstructured natural language text and code snippets 
embedded in each post. To effectively utilise this crowd wisdom, in this 
thesis, we propose three different novel works that leverage the SO wisdom 
to help developers improve their productivity.

In the first work, we propose mining SO to help developers to debug their 
code. Our approach finds defective code fragments (from developers’ 
software projects) by detecting code clones between the code snippets in 
SO questions and the code in developers’ software projects, before 
processing them to triangulate the source code anomalies inside 
developers’ software projects. Our approach reveals 189 warnings and 171 
(90.5%) of them are confirmed by developers from eight high-quality and 
well-maintained projects. We also compared the confirmed bugs with three 
popular static analysis tools (FindBugs, JLint and PMD). Of the 171 bugs 
identified by our approach, only FindBugs detected six of them whereas 
JLint and PMD detected none.

In the second work, we propose highlighting problem-cause and solution 
summary sentences in answer posts to guide developers in reading the 
answers. A recent survey revealed that majority of the non-native English 
speaking developers have trouble understanding English text and source 
code as the programming languages are all English-based, and they prefer 
more visuals in QA sites such as SO to help them understand the content 
easier. Separately, it has also been reported that the irrelevance and 
redundancy of SO answers may inhibit developers’ ability to retrieve 
information from SO efficiently. We also observed that in many of the SO 
answers, a single sentence can represent the high-level description of the 
problem-cause or solution of the question asked. We thus propose 
highlighting both problem-cause and solution summary sentences in the SO 
answer posts to guide developers in their reading. Our technique comprises 
of ensemble models of extractive summarization techniques involving 
detecting salient sentences. Compared with other extractive summarization 
methods, including the state-of-the art, our approach consistently 
outperforms them between 13.41% and 40.91% for problem-cause extractive 
summarization, and between 4.12% and 40.28% for solution summarization, 
with respect to relative improvement. A user study was also conducted with 
developers and most of them reported that the extracted summaries are 
accurate and the summaries help them to read the answers better.

In the third work, we propose generating SQL statements automatically from 
natural language. Using natural language to program has been a 
long-cherished dream. Existing works on generating SQL queries from 
natural language are conditioned either on some given table schema or 
relational databases. We analyzed real-world developers’ data management 
issues in SO and found that these scenarios are a tiny portion of a myriad 
of other problems developers are facing. In this work, we propose an 
end-to-end general purpose natural language to SQL (NL2SQL) statement 
generation using SO dataset. Our method also incorporates a denoising 
module that can be applied to correct SQL syntax errors induced in the 
generated SQL queries regardless of the NL2SQL generation model used. 
Experiments show that the proposed NL2SQL yields more syntactically 
correct queries (up to 43% more using a Seq2Seq model) in most of the 
cases.


Date:			Tuesday, 7 August 2018

Time:			10:30am - 12:30pm

Venue:			Room 5560
 			Lifts 27/28

Chairman:		Prof. Xinghua Zheng (ISOM)

Committee Members:	Prof. Sunghun Kim (Supervisor)
 			Prof. Andrew Horner
 			Prof. Frederick Lochovsky
 			Prof. Eric Nelson (HUMA)
 			Prof. Doo-Hwan Bae (KAIST)


**** ALL are Welcome ****