Profiling Methods for Memory Centric Software Performance Analysis
Abstract: To reduce latency and increase bandwidth to memory, modern microprocessors are often designed with deep memory hierarchies including several levels of caches. For such microprocessors, both the latency and the bandwidth to off-chip memory are typically about two orders of magnitude worse than the latency and bandwidth to the fastest on-chip cache. Consequently, the performance of many applications is largely determined by how well they utilize the caches and bandwidths in the memory hierarchy. For such applications, there are two principal approaches to improve performance: optimize the memory hierarchy and optimize the software. In both cases, it is important to both qualitatively and quantitatively understand how the software utilizes and interacts with the resources (e.g., cache and bandwidths) in the memory hierarchy.This thesis presents several novel pro?ling methods for memory-centric software performance analysis. The goal of these pro?ling methods is to provide general, high-level, quantitative information describing how the pro?led applications utilize the resources in the memory hierarchy, and thereby help software and hardware developers identify opportunities for memory related hardware and software optimizations. For such techniques to be broadly applicable the data collection should have minimal impact on the pro?led application, while not being dependent on custom hardware and/or operating system extensions. Furthermore, the resulting pro?ling information should be accurate and easy to interpret.While several use cases are presented, the main focus of this thesis is the design and evaluation of the core pro?ling methods. These core pro?ling methods measure and/or estimate how high-level performance metrics, such as miss-and fetch ratio; off-chip bandwidth demand; and execution rate are affected by the amount of resources the pro?led applications receive. This thesis shows that such high-level pro?ling information can be accurately obtained with very little impact on the pro?led applications and without requiring costly simulations or custom hardware support.
CLICK HERE TO DOWNLOAD THE WHOLE DISSERTATION. (in PDF format)