Predicting Bugs by Analyzing Software History
Jim Whitehead
UC Santa Cruz

Almost all software contains undiscovered bugs, ones that have not yet
been exposed by testing or by users. Wouldn't it be nice if there was
a way to know the location of these bugs? This talk presents two
approaches for predicting the location of bugs. The bug cache contains
10% of the files in a software project. Through an analysis of the
software's development history and the location of bugs, files are
added and removed from the cache based on four notions of bug
locality: temporal, spatial, changed-entity, and new-entity locality.
After processing, files in the bug cache contain 73-95% of
undiscovered bugs. To improve the localization of predicted bugs, the
second prediction approach focuses on configuration management commit
transactions. Using machine learning techniques (Support Vector
Machines), we classify commits as being likely to have a fault, or
unlikely to have a fault. The best precision figures for each project
are typically in the mid-70's. Hence, it is possible for a
configuration management system to inform a developer, post-commit,
that they have just created a bug (with appx. 75% likelihood).
Recently graduated PhD Sung Kim contributed heavily to this work.


Jim Whitehead is an Associate Professor of Computer Science at the
University of California, Santa Cruz. Jim's research interests lie in
the areas of software evolution, software design, software
configuration management, and application layer internet protocols.
He has recently developed a new degree program, the BS Computer
Science: Computer Game Design. Jim received his PhD in Information and
Computer Science from UC Irvine, in 2000, under his advisor Richard
N. Taylor.