Predicting Bugs by Analyzing Software History Jim Whitehead UC Santa Cruz Almost all software contains undiscovered bugs, ones that have not yet been exposed by testing or by users. Wouldn't it be nice if there was a way to know the location of these bugs? This talk presents two approaches for predicting the location of bugs. The bug cache contains 10% of the files in a software project. Through an analysis of the software's development history and the location of bugs, files are added and removed from the cache based on four notions of bug locality: temporal, spatial, changed-entity, and new-entity locality. After processing, files in the bug cache contain 73-95% of undiscovered bugs. To improve the localization of predicted bugs, the second prediction approach focuses on configuration management commit transactions. Using machine learning techniques (Support Vector Machines), we classify commits as being likely to have a fault, or unlikely to have a fault. The best precision figures for each project are typically in the mid-70's. Hence, it is possible for a configuration management system to inform a developer, post-commit, that they have just created a bug (with appx. 75% likelihood). Recently graduated PhD Sung Kim contributed heavily to this work. Jim Whitehead is an Associate Professor of Computer Science at the University of California, Santa Cruz. Jim's research interests lie in the areas of software evolution, software design, software configuration management, and application layer internet protocols. He has recently developed a new degree program, the BS Computer Science: Computer Game Design. Jim received his PhD in Information and Computer Science from UC Irvine, in 2000, under his advisor Richard N. Taylor.