Understanding Multicore Performance Efficient Memory System Modeling and Simulation

University dissertation from Uppsala : Acta Universitatis Upsaliensis

Abstract: To increase performance, modern processors employ complex techniques such as out-of-order pipelines and deep cache hierarchies. While the increasing complexity has paid off in performance, it has become harder to accurately predict the effects of hardware/software optimizations in such systems. Traditional microarchitectural simulators typically execute code 10 000×–100 000× slower than native execution, which leads to three problems: First, high simulation overhead makes it hard to use microarchitectural simulators for tasks such as software optimizations where rapid turn-around is required. Second, when multiple cores share the memory system, the resulting performance is sensitive to how memory accesses from the different cores interleave. This requires that applications are simulated multiple times with different interleaving to estimate their performance distribution, which is rarely feasible with today's simulators. Third, the high overhead limits the size of the applications that can be studied. This is usually solved by only simulating a relatively small number of instructions near the start of an application, with the risk of reporting unrepresentative results.In this thesis we demonstrate three strategies to accurately model multicore processors without the overhead of traditional simulation. First, we show how microarchitecture-independent memory access profiles can be used to drive automatic cache optimizations and to qualitatively classify an application's last-level cache behavior. Second, we demonstrate how high-level performance profiles, that can be measured on existing hardware, can be used to model the behavior of a shared cache. Unlike previous models, we predict the effective amount of cache available to each application and the resulting performance distribution due to different interleaving without requiring a processor model. Third, in order to model future systems, we build an efficient sampling simulator. By using native execution to fast-forward between samples, we reach new samples much faster than a single sample can be simulated. This enables us to simulate multiple samples in parallel, resulting in almost linear scalability and a maximum simulation rate close to native execution.

  CLICK HERE TO DOWNLOAD THE WHOLE DISSERTATION. (in PDF format)