Data Centric Cache Measurement Using Hardware and Software Instrumentation
Publication or External Link
The speed at which microprocessors can perform computations is increasing faster than the speed of access to main memory, making efficient use of memory caches ever more important. Because of this, information about the cache behavior of applications is valuable for performance tuning. To be most useful to a programmer, this information should be presented in a way that relates it to data structures at the source code level; we will refer to this as data centric cache information. This disser-tation examines the problem of how to collect such information. We describe tech-niques for accomplishing this using hardware performance monitors and software in-strumentation. We discuss both performance monitoring features that are present in existing processors and a proposed feature for future designs.
The first technique we describe uses sampling of cache miss addresses, relat-ing them to data structures. We present the results of experiments using an imple-mentation of this technique inside a simulator, which show that it can collect the de-sired information accurately and with low overhead. We then discuss a tool called Cache Scope that implements this on actual hardware, the Intel Itanium 2 processor. Experiments with this tool validate that perturbation and overhead can be kept low in a real-world setting. We present examples of tuning the performance of two applica-tions based on data from this tool. By changing only the layout of data structures, we achieved approximately 24% and 19% reductions in running time.
We also describe a technique that uses a proposed hardware feature that pro-vides information about cache evictions to sample eviction addresses. We present results from an implementation of this technique inside a simulator, showing that even though this requires storing considerably more data than sampling cache misses, we are still able to collect information accurate enough to be useful while keeping overhead low. We discuss an example of performance tuning in which we were able to reduce the running time of an application by 8% using information gained from this tool.