Leveraging Breakable Dependences for Parallelization

slide1 n.w
1 / 20
Embed
Share

This paper explores the concept of breakable dependences to achieve parallelization in software engineering. It discusses techniques for exploiting dependences to improve speedup on multi-core systems, with examples from algorithms like Floyd-Warshall and K-Means. The approach involves reordering and breaking dependences to enhance performance while ensuring rigorous software engineering practices.

  • Parallelization
  • Dependences
  • Software Engineering
  • Speedup
  • Multi-core

Uploaded on | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.

E N D

Presentation Transcript


  1. ?,? ?,? ?,? ? ? ,?[? ] 1 0 1 ALTER: Exploiting Breakable Dependences for Parallelization Kaushik Rajan Abhishek Udupa William Thies Rigorous Software Engineering Microsoft Research, India

  2. ?,? ?,? ?,? ? ? ,?[? ] Parallelization Reconsidered 1 0 1 No DOALL Parallelism Sequential program Are there dependences between loop iterations? Yes

  3. ?,? ?,? ?,? ? ? ,?[? ] Parallelization Reconsidered 1 0 1 No DOALL Parallelism Sequential program Are there dependences between loop iterations? Yes Our Technique: 2.0x speedup on four cores SG3D Floyd-Warshall Agglomerative Clustering Gauss Seidel K-Means Break Commutativity Analysis Speedup Speculation Dependences! No No Speedup Dependences are Imprecise Dependences can be Reordered Dependences can be Broken Rigorous Software Engineering Rigorous Software Engineering Microsoft Research, India Microsoft Research, India

  4. ?,? ?,? ?,? ? ? ,?[? ] Parallelization Reconsidered 1 0 1 No DOALL Parallelism Sequential program Are there dependences between loop iterations? Yes Our Technique: 2.0x speedup on four cores ALTER SG3D Floyd-Warshall Agglomerative Clustering Gauss Seidel K-Means Break Commutativity Analysis Speedup Speculation Dependences! No No Speedup Dependences are Imprecise Dependences can be Reordered Dependences can be Broken Rigorous Software Engineering Rigorous Software Engineering Microsoft Research, India Microsoft Research, India

  5. ?,? ?,? ?,? ? ? ,?[? ] Outline 1 0 1 Breakable Dependences: Stale Reads Deterministic Runtime System Assisted Parallelization Results *other details in the paper*

  6. ?,? ?,? ?,? ? ? ,?[? ] Breakable Dependences in an Iterative Convergence Algorithm 1 0 1 Examples: Floyd Warshall algorithm Monotonic data-flow analyses Linear algebra solvers Stencil computations while(!converged) { for i = 1 to n { refine(soln[i]) } } sequential DO WHILE ALTER: stale reads DO WHILE privatized DO WHILE I(n) I(n) I(n) I(2) I(2) I(2) shared memory I(1) I(1) I(1) merge

  7. ?,? ?,? ?,? ? ? ,?[? ] Stale Reads Execution Model 1 0 1 W1 W2 3 1 5 7 8 2 4 6 ?1 ?2= { } Stale reads Execution valid under staleReads model iff Commit order is some serial order of iterations (can be different from sequential order) Each iteration reads a stale but consistent snapshot Staleness is bounded: no intersecting writes by intervening iterations Akin to Snapshot Isolation for databases

  8. ?,? ?,? ?,? ? ? ,?[? ] Stale Reads with Reduction 1 0 1 ?2,?2? ?2 W1,?1? W1 3 1 5 7 8 2 4 6 ?) (?2 ?2?) = (?1 W1 ????????? ? 1. Every access to var is an update using operation O 2. Operator O is commutative and associative ???,? where

  9. ?,? ?,? ?,? ? ? ,?[? ] Deterministic Runtime System 1 0 1 state FORK() private private private body(1) with RW logging body(2) with RW logging body(3) with RW logging 1 2 3 EXECUTE() Commit? 2 Commit? 1 3 JOIN() state StaleReads Commit(i): ? ??.?<? ?????? ? ?????? ? = {}

  10. ?,? ?,? ?,? ? ? ,?[? ] Alter Annotations 1 0 1 while(error < EPSILON) { //convergence loop error = 0.0; for(uint32_t i = 1; i < grid->xmax - 1; ++i) { [StaleReads, (error, max)] for(uint32_t j = 1; j < grid->ymax - 1; ++j) { for(uin32_t k = 1; k < grid->zmax - 1; ++k) { oldValue = grid[i][j][k] grid[i][j][k] = a * grid[i][j][k] + b * AddDirectNbr(grid) + c * AddSquareNbr(grid) + d * AddCubeNbr(grid); error = max(error, (OldValue,GridPtr[i][j][k]))); } }

  11. ?,? ?,? ?,? ? ? ,?[? ] Test Driven Parallelism Inference 1 0 1 Exhaustive parallelization engine For each annotation run all test cases, record outcome outcome of a single run ???????,??????? (crash, timeout, high contention, output mismatch) Output mismatch: assertion failures or floating point difference < 0.01% Sequential program Test suite Exhaustive parallelization engine Candidate Parallel program User validation

  12. ?,? ?,? ?,? ? ? ,?[? ] Assisted Parallelism 1 0 1 ALTER Prior art Assisted parallelism Automatic parallelism Sequential program Test suite Sequential program Exhaustive parallelization engine Conservative Compiler analysis Candidate Parallel program User Parallel program validation Auto tune for perf Preserve program dependences Preserve functionality

  13. ?,? ?,? ?,? ? ? ,?[? ] Benchmarks 1 0 1 BENCHMARK ALGORITHM TYPE PARALLELISM LOOP WGT AggloClust Branch & bound STALE READS 89% GSdense Dense algebra STALE READS 100% GSsparse Sparse algebra STALE READS 100% FloydWarshall Dynamic programming STALE READS 100% SG3D Structured grids STALE READS, (error, max) 96% BarnesHut N-body methods DOALL 99.6% FFT Spectral methods DOALL 100% HMM Graphical models DOALL 100% Bioinformatics Genome STALE READS 89% Scientific SSCA2 STALE READS 76% Data mining K-means STALE READS, (delta, +) 89% Engineering Labyrinth _ 99%

  14. ?,? ?,? ?,? ? ? ,?[? ] Experimental Setup 1 0 1 Experiments on a 2 x quad core Xeon processor Alter transformations in Microsoft Phoenix compiler framework Comparison with dependence speculation and manual parallelization of 2 applications

  15. ?,? ?,? ?,? ? ? ,?[? ] Results : Baseline 1 0 1 6 5 staleReads OutOfOrder speculate DOALL 4 3 No scope for dependence speculation 2 1 No scope for dependence speculation 0

  16. ?,? ?,? ?,? ? ? ,?[? ] Results : Alter 1 0 1 6 5 staleReads OutOfOrder speculate DOALL 4 3 2 1 0

  17. ?,? ?,? ?,? ? ? ,?[? ] Results: Manual Parallelization 1 0 1 6 Good speedup with fine grain locking manual staleReads OutOfOrder speculate DOALL 5 Comparable performance 4 3 2 1 0

  18. ?,? ?,? ?,? ? ? ,?[? ] In the Paper 1 0 1 ALTER multi-process memory allocator ALTER collections Usage scenario s for ALTER Profiling and instrumentation overhead DOALL parallelism and speculation within ALTER

  19. ?,? ?,? ?,? ? ? ,?[? ] Related Work 1 0 1 Test-driven parallelization QuickStep: similar testing methods for non-deterministic programs, offers accuracy bounds [Rinard 2010] Assisted parallelization [Taylor 2011] [Tournavitis 2009] Paralax: annotations improve precision of analysis, but dependences respected [Vandierendonck 2010] Implicit parallelization [Burckhardt 2010] Commutative annotation for reordering[August 2007, 11] Optimistic execution of irregular programs [Pingali 2008] As far as we know, stale reads execution model is new

  20. ?,? ?,? ?,? ? ? ,?[? ] Conclusions 1 0 1 Breakable dependences must be exploited in order to parallelize certain classes of programs We propose a new execution model, StaleReads, that violates dependences in a principled way Adopt database notion of Snapshot Isolation for loop parallelization ALTER is a compiler and deterministic runtime system that discovers new parallelism in programs We believe tools for assisted parallelism can help to overcome the limits of automatic parallelization

Related


More Related Content