Automatic Techniques for Code Example Generation

The Hong Kong University of Science and Technology
Department of Computer Science and Engineering


PhD Thesis Defence


Title: "Automatic Techniques for Code Example Generation"

By

Mr. Xiaodong Gu


Abstract

Developers often wonder how to implement a program functionality. Code
examples are very helpful in this regard. Over the years, many approaches
have been proposed to generate code examples. The existing approaches
often treat queries and source code as textual documents and utilize
information retrieval models to retrieve relevant code snippets that match
a given query.

However, conventional code example generation approaches involve the
following major challenges. First, they rely on a bag-of-words assumption
and cannot recognize high-level features of queries and source code.
Second, source code and natural language queries are heterogeneous.
Existing approaches mainly rely on the textual similarity between source
code and natural language query. They lack a mapping of high-level
semantics between queries and source code. Moreover, the generated code
examples may be redundant and project-specific, this requires to generate
succinct and high-coverage code examples.

To address these challenges, in this thesis, we propose three machine
learning based approaches to the generation of code examples. Instead of
mapping keywords, our approaches learn the deep semantics of queries and
code snippets.

We first propose a technique, DeepAPI which generates API usage sequences
via deep learning. DeepAPI adapts a neural language model named RNN
Encoder-Decoder. Given a corpus of annotated API sequences, i.e.,  pairs, DeepAPI trains the language model that
encodes each sequence of words (annotation) into a fixed-length context
vector and decodes an API sequence based on the context vector. Then, in
response to an API-related user query, it generates API sequences by
consulting the neural language model.

Furthermore, we propose a technique, DeepCodeHow to generate code examples
via searching from existing code corpus. To bridge the lexical gap between
queries and source code, DeepCodeHow jointly embeds code snippets and
natural language descriptions into a high-dimensional vector space. With
the unified vector representation, code snippets semantically related to a
natural language query can be retrieved according to their vectors.

Finally, to generate succinct and high-coverage examples, we design a code
example selection technique named CodeKernel. CodeKernel leverages a
machine learning technique named Graph Kernel. It represents code snippets
as object usage graphs and embeds graphs into a high-level vector space.
With the graph embedding, CodeKernel clusters similar graphs and selects a
typical graph as the code example.

We empirically evaluate our techniques on a large scale code corpus
collected from GitHub. The experimental results show that Our proposed
techniques effectively generate relevant code examples and outperform the
conventional IR-based approaches.


Date:			Friday, 30 June 2017

Time:			3:00pm - 5:00pm

Venue:			Room 2612A
 			Lifts 31/32

Chairman:		Prof. Huihe Qiu (MAE)

Committee Members:	Prof. Sunghun Kim (Supervisor)
 			Prof. Frederick Lochovsky
 			Prof. Xiaojuan Ma
 			Prof. Yiwen Wang (ECE)
 			Prof. Alice Oh (Comp. Sci., KAIST)


**** ALL are Welcome ****