Improving OpenMP Productivity with Data Locality Optimizations and High-resolution Performance Analysis

University dissertation from Stockholm : KTH Royal Institute of Technology

Abstract: The combination of high-performance parallel programming and multi-core processors is the dominant approach to meet the ever increasing demand for computing performance today. The thesis is centered around OpenMP, a popular parallel programming API standard that enables programmers to quickly get started with writing parallel programs. However, in contrast to the quickness of getting started, writing high-performance OpenMP programs requires high effort and saps productivity.Part of the reason for impeded productivity is OpenMP’s lack of abstractions and guidance to exploit the strong architectural locality exhibited in NUMA systems and manycore processors. The thesis contributes with data distribution abstractions that enable programmers to distribute data portably in NUMA systems and manycore processors without being aware of low-level system topology details. Data distribution abstractions are supported by the runtime system and leveraged by the second contribution of the thesis – an architecture-specific locality-aware scheduling policy that reduces data access latencies incurred by tasks, allowing programmers to obtain with minimal effort upto 69% improved performance for scientific programs compared to state-of-the-art work-stealing scheduling.Another reason for reduced programmer productivity is the poor support extended by OpenMP performance analysis tools to visualize, understand, and resolve problems at the level of grains– task and parallel for-loop chunk instances. The thesis contributes with a cost-effective and automatic method to extensively profile and visualize grains. Grain properties and hardware performance are profiled at event notifications from the runtime system with less than 2.5% overheads and visualized using a new method called theGrain Graph. The grain graph shows the program structure that unfolded during execution and highlights problems such as low parallelism, work inflation, and poor parallelization benefit directly at the grain level with precise links to problem areas in source code. The thesis demonstrates that grain graphs can quickly reveal performance problems that are difficult to detect and characterize in fine detail using existing tools in standard programs from SPEC OMP 2012, Parsec 3.0 and Barcelona OpenMP Tasks Suite (BOTS). Grain profiles are also applied to study the input sensitivity and similarity of BOTS programs.All thesis contributions are assembled together to create an iterative performance analysis and optimization work-flow that enables programmers to achieve desired performance systematically and more quickly than what is possible using existing tools. This reduces pressure on experts and removes the need for tedious trial-and-error tuning, simplifying OpenMP performance analysis.

  This dissertation MIGHT be available in PDF-format. Check this page to see if it is available for download.