Asymmetric Performance in Task Runtimes

Asymmetric Performance in Task Runtimes
Slide Note
Embed
Share

This study explores the impact of asymmetric performance on asynchronous task-based runtimes in HPC environments. It discusses the challenges, potential solutions, and experimental evaluations using different runtimes. The focus is on addressing performance variability and adaptability in task-based systems.

  • Asymmetric Performance
  • Task Runtimes
  • HPC Environments
  • Experimental Evaluation
  • System Community

Uploaded on Mar 13, 2025 | 1 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.

E N D

Presentation Transcript


  1. The Effect of Asymmetric Performance on Asynchronous Task Based Runtimes Debashis Ganguly and John R. Lange ROSS 2017

  2. Changing Changing F Face of HPC Environments ace of HPC Environments Task-based Runtimes: Potential solution Traditional: Dedicated Resources Simulation Future: Collocated Workloads Visualization Simulation Visualization Supercomputer Processing Cluster Supercomputer Storage Cluster Goal: Can asynchronous task-based runtimes handle asymmetric performance 2

  3. Task Task- -based Runtimes based Runtimes Experiencing renewal in interest in systems community Assumed to better address performance variability Adopt (Over-)Decomposed task-based model Allow fine-grained scheduling decisions Able to adapt to asymmetric/variable performance But Originally designed for application induced load imbalances, e.g., an adaptive mesh refinement (AMR) based application Performance asymmetry can be of finer granularity, e.g., variable CPU time in time-shared environments 3

  4. Basic Experimental Evaluation Basic Experimental Evaluation Synthetic situation Emulate performance asymmetry in time-shared configuration Static and predictable setting Benchmark on 12 cores, share one core with background workload Vary the percentage of CPU time of competing workload Environment: 12 core dual socket compute node, hyperthreading disabled Used cpulimit to control percentage of CPU time 4

  5. Workload Configuration Workload Configuration Node 1 Node 0 Idle Benchmark Competing Workload 11 cores settings 12 cores settings 5

  6. Experimental Setup Experimental Setup Evaluated two different runtimes: Charm++:LeanMD HPX-5: LULESH, HPCG, LibPXGL Competing Workload: Prime Number Generator: entirely CPU bound, a minimal memory footprint Kernel Compilation: stresses internal OS features such as I/O and memory subsystem 6

  7. Charm++ Charm++ Iterative over-decomposed applications Object based programming model Tasks implemented as C++ objects Objects can migrate across intra and inter-node boundaries 7

  8. Charm++ Charm++ A separate centralized load balancer component Preempts application progress Actively migrates objects based on current state Causes computation to block across the other cores 8

  9. Choice of Choice of Load Load Balancer Matters Balancer Matters Comparing performance of different load balancing strategies and without any load balancer 350 225.09 198% divergence Percentage performance degradation 300 178.65 250 132.21 Runtime (s) 200 85.77 150 39.33 100 -7.11 0 10 20 30 40 50 60 70 80 90 Perc. of CPU utilized by the background workload of prime number generator running on 12th core GreedyLB RotateLB RandCentLB RefineLB RefineSwapLB No load balabncer We selected RefineSwapLB for the rest of the experiments. 9

  10. Invocation Frequency Matters Invocation Frequency Matters MetaLB: Invoke load balancer less frequently based on heuristics 450 400 350 300 Time (s) 250 200 150 100 50 0 0 10 20 30 40 50 60 70 80 90 Perc. of CPU utilized by the background workload of prime number generator running on 12th core Total Runtime of RefineSwapLB with MetaLB Overhead of RefineSwapLB with MetaLB Total Runtime of RefineSwapLB without MetaLB Overhead of RefineSwapLB without MetaLB Load balancing overhead of RefineSwapLB with or without MetaLB We enabled MetaLB for our experiments. 10

  11. Charm++: LEANMD Charm++: LEANMD 190 76.5 Percentage performance degradation 180 67.21 25% to background workload 170 57.92 160 48.63 Runtime (s) 150 39.34 53% divergence 140 30.05 130 20.76 120 11.47 110 2.18 100 -7.11 0 10 20 30 40 50 60 70 80 90 100 Perc. of CPU utilized by the background workload of prime number generator on 12th core 11 threads on 11 cores 12 threads on 12 cores (theoretical expectation) 12 threads on 12 cores Sensitivity of perc. of CPU utilization by the background workload of prime number generator 12 cores are worse than 11 cores unless you have at least 75% of the core s capacity. If the application cannot get more than 75% of the core s capacity, then is better off ignoring the core completely. 11

  12. Charm++: LEANMD Charm++: LEANMD More variable, but consistent mean performance. 25% to background workload 150 39.88 Percentage performace degradation 145 35.14 140 30.4 135 25.66 Runtime (s) 130 20.92 125 16.18 120 11.44 115 6.7 110 1.96 105 -2.78 0 10 20 30 40 50 60 70 80 90 100 Perc. of CPU utilized by the background workload of kernel compilation on 12th core 12 threads on 12 cores 11 threads on 11 cores Sensitivity of perc. of CPU utilization by the background workload of kernel compilation 12

  13. HPX HPX- -5 5 f f f f f f f f f f f f f f f f Parcel: Contains a computational task and a reference to the data the task operates on Follows Work-First principle of Cilk-5. Every scheduling entity processes parcels from top of their scheduling queues. 13

  14. HPX HPX- -5 5 f f f f f Implemented using Random Work Stealing No centralized decision making process Overhead of work stealing is assumed by the stealer. 14

  15. OpenMP OpenMP: LULESH : LULESH Overall application performance determined by the slowest rank. Vulnerable to asymmetries in performance. Rely on collective based communication. 220 215.7 Percentage performance degardation 200 187 180 158.3 160 129.6 Runtime (s) 185% divergence 140 100.9 120 72.2 100 43.5 80 14.8 60 -13.9 0 10 20 30 40 50 60 70 80 90 100 Perc. of CPU utilized by the background workload of prime number generator on 12th core 12 threads on 12 cores 11 threads on 11 cores 12 threads on 12 cores (theoretical expectation) Sensitivity of perc. of CPU utilization by the background workload of prime number generator 15

  16. HPX HPX- -5: LULESH 5: LULESH A traditional BSP application implemented using task-based programming 190 99.53 Percentage performance degradation 180 89.03 170 78.53 160 68.03 Runtime (s) 150 57.53 42% divergence 140 47.03 130 36.53 120 26.03 110 15.53 100 5.03 90 -5.47 0 10 20 30 40 50 60 70 80 90 100 Perc. of CPU utilization by the background workload of prime number generator on 12th core 12 threads on 12 cores 11 threads on 11 cores 12 threads on 12 cores (theoretical expectation) Sensitivity of perc. of CPU utilization by the background workload of prime number generator No cross-over point 12 cores are consistently worse than 11 cores 16

  17. HPX HPX- -5: HPCG 5: HPCG Another BSP application implemented in task- based model 255 11.465 10% to background workload Percentage performance degradation 250 9.28 245 7.095 5% divergence Runtime (s) 240 4.91 235 2.725 230 0.54 225 -1.645 220 -3.83 0 10 20 30 40 50 60 70 80 90 100 Perc. of CPU utilization by the background workload of prime number generator on 12th core 12 threads on 12 cores 11 threads on 11 cores 12 threads on 12 cores (theoretical expectation) Sensitivity of perc. of CPU utilization by the background workload of prime number generator Better than the theoretical expectation 12 cores are consistently worse than 11 cores 17

  18. HPX HPX- -5: An asynchronous graph processing library A more natural fit 5: LibPXGL LibPXGL 135 17.2 22% to background workload Percentage performance degradation 5% divergence 130 12.86 Runtime (s) 125 8.52 120 4.18 115 -0.16 110 -4.5 0 10 20 30 40 50 60 70 80 90 100 Perc. of CPU utilization by the background workload of prime number generator on 12th core 12 threads on 12 cores 11 threads on 11 cores 12 threads on 12 cores (theoretical expectation) Sensitivity of perc. of CPU utilization by the background workload of prime number generator No cross-over point 12 cores are consistently worse than 11 cores 18

  19. HPX HPX- -5: Kernel Compilation 5: Kernel Compilation More immediate, instead of gradual decline. 140 45.74 280 21.28 Percentage performance degradation Percentage performance degradation 135 40.54 270 16.95 130 35.34 125 30.14 260 12.62 Runtime (s) Runtime (s) 120 24.94 115 19.74 250 8.29 110 14.54 240 3.96 105 9.34 100 4.14 230 -0.37 95 -1.06 90 -6.26 220 -4.7 0 10 20 30 40 50 60 70 80 90 100 0 10 20 30 40 50 60 70 80 90 100 Percentage of CPU speed consumed by background workload on 12th core Percentage of CPU speed consumed by background workload on 12th core 12 threads on 12 cores 11 threads on 11 cores 12 threads on 12 cores 11 threads on 11 cores LULESH HPCG 135 16.95 Percentage performance degradation 130 12.62 Runtime (s) 125 8.29 120 3.96 115 -0.37 110 -4.7 0 10 20 30 40 50 60 70 80 90 100 Percentage of CPU speed consumed by background workload on 12th core 12 threads on 12 cores 11 threads on 11 cores LibPXGL 19

  20. Conclusion Conclusion Performance asymmetry is still challenging Preliminary evaluation: Tightly controlled time-shared CPUs Static and consistent configuration Better than BSP, but On average a CPU loses its utility to a task based runtime as soon as its performance diverges by only 25%. 20

  21. Thank You Thank You Debashis Ganguly Ph.D. Student, Computer Science Department, University of Pittsburgh debashis@cs.pitt.edu https://people.cs.pitt.edu/~debashis/ The Prognostic Lab http://www.prognosticlab.org 21

More Related Content