From Sung Kim
My core research area is Software Engineering, focusing on software evolution, program analysis, and
empirical studies. My chief research interest is programmer productivity, in particular, identifying faults
in program development or in deployed programs by mining software repositories, source code (static
analysis), and program execution (dynamic analysis).
A common theme in my research is analysis of software faults. The consequences of faulty software
can include enormous loss of money or even lives. Providing a less fault-prone software development
environment and effective algorithms to find faults is a thrilling and challenging research topic and
goal. Towards this goal, I have been developing techniques that learn from successes and failures in
software evolution, static source code, and program execution.
For example, BugCache learns from cached history and predicts future faults. ChangeClassification
learns from previous bugs and indicates whether a new change includes a bug. ReCrash learns
the status of method calls from program crashes and reproduces the crashes. Currently, I am
participating in an ongoing project called Application Communities, which learns invariants from normal
and abnormal program executions, and dynamically prevents abnormal execution.
Mining Software Repositories
Identifying the most fault-prone software modules and changes provides several benefits. It permits
available resources to be focused on the modules or changes that have the most faults. Additionally,
such a list of identified faults makes it possible to selectively use time-intensive techniques, such as
manual inspection, formal methods, and various kinds of static code analysis. I have developed two
successful fault prediction algorithms, BugCache and ChangeClassification.
- BugCache - "If a file had a fault recently, it will tend to have other faults soon."
- BugCache maintains a cache of locations that are likely to have faults. When a fault occurs, BugCache caches the location itself as well as any locations changed together with the fault, recently added locations, and recently changed locations. In the evaluation of seven open source projects with more than 200,000 revisions, a cache containing 10% of the source code files accounts for 73%-95% of faults – a significant advance beyond the state of the art.
- ChangeClassification - "I can tell whether your change includes bugs as soon as you submit it."
- ChangeClassification uses a machine learning classifier to determine whether a new software change is more similar to prior buggy changes, or to clean changes. In this manner, ChangeClassification predicts when bugs are introduced by software changes. The classifier is trained using features (in the machine learning sense) extracted from the revision history of a software project. Our trained classifier can classify buggy changes with 78% accuracy and identify 65% of all bugs on average.
- ReCrash - "Making crashes reproducible."
- It is difficult to fix a problem without being able to reproduce it. However, reproducing a problem is often difficult and time-consuming. I implemented ReCrash, which generates multiple unit tests that reproduce a given program crash. ReCrash dynamically monitors method calls during every execution of the target program. If the program crashes, ReCrash stores the information learned about the relevant method calls and uses the saved information to create unit tests reproducing the crash. ReCrash reproduces all real crashes from javac, SVNKit, Eclipse JDT, and BST. ReCrash is efficient, incurring only 13%-64% performance overhead. If this overhead is unacceptable, then ReCrash has another mode that has negligible overhead until a crash occurs, and 0%-1.7% overhead until a second crash occurs, at which point the test cases are generated.
- Application Communities - "the zero-day patch"
- I have participated in the Application Communities project and developed an infrastructure that monitors program execution and prevents program failure. The infrastructure learns invariants that distinguish normal program executions from abnormal executions. The infrastructure generates dynamic patches for the abnormal program executions (faults). Created patches are automatically tested and successful patches are automatically deployed – a zero-day patch.
Current research projects
- Mining semantic fault patterns
- By mining software history, BugCache and ChangeClassification successfully predict fault locations. However, these approaches do not provide reasons why predicted locations are faulty. Mining semantic patterns and reasons for faults helps developers to understand the faults and prevent them in advance. Preliminary research on mining important factors to find faults reveals that, surprisingly, source code that includes ’if,’ ’<=’, or ’0’ is more fault-prone . Conducting research on mining semantic fault patterns is a future research topic.
- Combining multiple approaches
- Combining analysis and mining approaches opens a new research area in which these complement each other to identify important faults. For example, I use static analysis and mining software history to prioritize the output of static analysis. The result is promising in that mining software history significantly improves the static analysis results. My experience in both dynamic and static analysis will lead to more research on combining these approaches, including mining software repositories, static analysis, and dynamic analysis to improve software developer productivity.
- Making production-level tools with industrial support
- Proposed algorithms such as BugCache, ChangeClassification, and ReCrash are useful in real software development or deployment processes. Developing production-level tools from proposed algorithms benefits developers. The tools can prevent developers creating faults and help debugging. Some companies, including Coverity, Apple, and Yahoo, are interested in such tools and I will work closely with industry.