Understanding, Discovering and Leveraging a Software System's Effective Configuration Space
MetadataShow full item record
Many modern software systems are highly configurable. While a high degree of configurability has many benefits, such as extensibility, reusability and portability, it also has its costs. In the worst case, the full configuration space of a system is the exponentially large combination of all possible option settings and every configuration can potentially produce unique behavior in the software system. Therefore, this software configuration space explosion problem adds combinatorial complexity to many already difficult software engineering tasks. To date, much of the research in this area has tackled this problem using black-box techniques, such as combinatorial interaction testing (CIT). Although these techniques are promising in systematizing the testing and analysis of configurable systems, they ignore a system's internal structure and we think that is a huge missed opportunity. We hypothesize that systems are often structured such that their effective configuration spaces -- the set of configurations needed to achieve a specific goal -- are often much smaller than their full configuration spaces. And if we can efficiently identify or approximate the effective configuration spaces, then we can use that information to greatly improve various software engineering tasks. To understand the effective configuration spaces of software systems, we used symbolic evaluation, a white-box analysis, to capture all executions a system can take under any configuration. The symbolic evaluation results confirmed that the effective configuration spaces are in fact the composition of many small, self-contained groupings of options. And we developed analysis techniques to succinctly characterize how configurations interact with a system's internal structures. We showed that while the majority of a system's interactions are relatively low strength, some important high-strength interactions do exist, and that existing approaches such as CIT are highly unlikely to generate them in practice. Results from our in-depth investigations serve as the foundation for developing new approaches to efficiently discovering effective configuration spaces. We proposed a new algorithm called interaction tree discovery (iTree) that aims to identify sets of configurations that are smaller than those generated by CIT, while also including important high-strength interactions missed by practical applications of CIT. On each iteration of iTree, we first use low-strength covering array to test the system under, and then apply machine learning techniques to discover new interactions that are potentially responsible for any new coverage seen. By repeating this process, iTree builds up a set of configurations likely to contain key high-strength interactions. We evaluated iTree and our results strongly suggest that iTree can identify high-coverage sets of configurations more effectively than traditional CIT or random sampling. We next developed the interaction learning approach that estimates the configuration interactions underlying the effective configuration space by building classification models for iTree execution results. This approach is light-weight, yet produces accurate estimates of the interactions; making leveraging effective configuration spaces practical for many software engineering tasks. Using this approach, we were able to approximate the effective configuration space of the ~1M-LOC MySQL, something that is infeasible using existing techniques, at very low cost.