Distilling Large Language Models for Software Engineering Tasks with Boostrap Instructing

The Hong Kong University of Science and Technology
Department of Computer Science and Engineering

Final Year Thesis Oral Defense

Title: "Distilling Large Language Models for Software Engineering Tasks with 
Boostrap Instructing"

by

LI Yijia

Abstract:

Although pre-trained large language models (LLMs) have demonstrated a 
remarkable ability in face of various software engineering problems in general, 
they are still insufficient to deal with domain-specific codes due to a lack of 
training data. Prevalent methods of fine-tuning LLM rely heavily on manually 
creation of instruction data which is very time-consuming and labor-intensive. 
In this project, we proposed a methodology combining machine learning 
techniques such as knowledge distillation, self-instruct, and data augmentation 
for bootstrapping the generation of task-specific training datasets, which is 
used for improving the performance of local LLMs on downstream software 
engineering applications. Our pipeline generates instructions, input, and 
output samples from a limited seed set using GPT-4, then filters invalid or 
similar pairs before using them to finetune the original model. Applying our 
method to Magicoder-S-DS-6.7B has shown a significant improvement in the 
accuracy of the binary classification problem regarding API misuse, 
overperforming the state-of-the- art LLMs which have larger parameter sizes. 
This project provides a fast and effective way for aligning pre-trained 
language models with downstream software engineering tasks, which facilitates 
the application of LLMs in the research of software engineering.


Date            : 3 May 2024 (Friday)

Time            : 14:00 - 14:40

Venue           : Room 5501 (near lifts 25/26), HKUST

Advisor         : Prof. CHEUNG Shing-Chi

2nd Reader      : Dr. XU Dan