Introduction to Loop Transformations in Compiler Optimizations

Introduction to Loop Transformations in Compiler Optimizations
Slide Note
Embed
Share

Delve into the world of compiler optimizations with a focus on loop transformations. Explore the significance of high-level optimizations, parallelism, and data locality. Understand the impact of loop transformations on program performance and learn why they are crucial for achieving faster execution times. Discover the challenges in automating optimizations and the intricacies of applying transformations effectively.

  • Compiler optimizations
  • Loop transformations
  • Parallelism
  • Data locality
  • Automation

Uploaded on Feb 27, 2025 | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.

E N D

Presentation Transcript


  1. Research in Compilers and Introduction to Loop Transformations Part II: Compiler Optimizations Tomofumi Yuki EJCP 2017 June 29, Toulouse

  2. Compiler Optimizations Low-level Optimizations register allocation instruction scheduling constant propagation ... High-level Optimizations loop transformations Focus for today coarse grained parallelism ... EJCP 2017, June 29, Toulouse 2

  3. High-Level Optimizations Goals: Parallelism and Data Locality Why Parallelism? Why Data Locality? Why High-Level? EJCP 2017, June 29, Toulouse 3

  4. Why Loop Transformations? The 90/10 Rule 90% of the execution time is spent in less than 10% of the source code Loop Nests hotspot of almost all programs few lines of change => huge impact natural source of parallelism EJCP 2017, June 29, Toulouse 4

  5. Why Loop Transformations? Which is faster? for (i=0; i<N; i++) for (j=0; j<N; j++) for (k=0; k<N; k++) C[i][j] += A[i][k] * B[k][j]; for (i=0; i<N; i++) for (k=0; k<N; k++) for (j=0; j<N; j++) C[i][j] += A[i][k] * B[k][j]; EJCP 2017, June 29, Toulouse 5

  6. Why is it Faster? Hardware Prefetching for (i=0; i<N; i++) for (j=0; j<N; j++) for (k=0; k<N; k++) C[i][j] += A[i][k] * B[k][j]; unchanged next col next row for (i=0; i<N; i++) for (k=0; k<N; k++) for (j=0; j<N; j++) C[i][j] += A[i][k] * B[k][j]; next col unchanged EJCP 2017, June 29, Toulouse next col 6

  7. How to Automate? The most challenging part! The same optimization doesn t work for: for (i=0; i<N; i++) for (j=0; j<N; j++) for (k=0; k<N; k++) { C1[i][j] += A1[i][k] * B1[k][j]; C2[i][j] += A2[i][k] * B2[k][j]; C3[i][j] += A3[i][k] * B3[k][j]; C4[i][j] += A4[i][k] * B4[k][j]; } Why? EJCP 2017, June 29, Toulouse 7

  8. Its Not Just Transformations Many many reasoning steps: What to apply? How to apply? Compiler Research is all about coming up with techniques/abstractions/representations to allow the compiler to perform deep analysis. When to apply? What is its impact? Quality of the analysis: How long does it take? Can it potentially degrade performance? Provable properties (completeness, etc.) EJCP 2017, June 29, Toulouse 8

  9. What I Cover Today Some techniques to reason about programs for parallelism and locality data dependences Some transformations tiling tiling tiling Some thing special: Simplifying Reductions EJCP 2017, June 29, Toulouse 9

  10. Outline Introduction Loop Parallelism and Dependences Dependence Tests Polyhedral Model Locality and Tiling EJCP 2017, June 29, Toulouse 10

  11. Loop Parallelism Simple transformation forall (i=0; i<N; i++) S; for (i=0; i<N; i++) S; Not so simple to reason about Legality Performance impacts More complicated cases Transform the loops to expose parallelism EJCP 2017, June 29, Toulouse 11

  12. Analyzing Legality Is there a parallel loop? for (i=0; i<N; i++) for (j=1; j<M; j++) { A[i][j] = A[i][j-1] + B[i][j]; } EJCP 2017, June 29, Toulouse 12

  13. Extended Graphs Completely unroll the loops for (i=0; i<5; i++) for (j=1; j<4; j++) { A[i][j] = A[i][j-1] + B[i][j]; } A[0][1] = A[0][0] + B[0][1]; A[0][2] = A[0][1] + B[0][2]; A[0][3] = A[0][2] + B[0][3]; A[1][1] = A[1][0] + B[1][1]; A[1][2] = A[1][1] + B[1][2]; A[1][3] = A[1][2] + B[1][3]; .... EJCP 2017, June 29, Toulouse 13

  14. Extended Graphs Completely unroll the loops for (i=0; i<N; i++) for (j=1; j<M; j++) { A[i][j] = A[i][j-1] + B[i][j]; } The difficulty: program parameters its easy with DAG representation scalability issues what if parameters are not known? EJCP 2017, June 29, Toulouse 14

  15. Iteration Spaces Need an abstraction for statement instances for (i=0; i<N; i++) for (j=1; j<M; j++) { A[i][j] = A[i][j-1] + B[i][j]; } j instance = integer vector [i,j] space = integer set 0 i<N and 1 j<M i EJCP 2017, June 29, Toulouse 15

  16. Dependences Express relations between statements flow (true) dependence RAW anti-dependence WAR output dependence WAW input dependence RAR EJCP 2017, June 29, Toulouse 16

  17. Dependence Abstractions Distance Vector distance between write and read [i,j] + c e.g., [0,1] Direction Vector direction of the instance that uses the value one of <, >, , , =, * e.g., [0,<] less precise, but sometimes sufficient EJCP 2017, June 29, Toulouse 17

  18. Direction Vector Example 1 for (i=0; i<N; i++) for (j=0; j<M; j++) A[i][j] = A[i][0] + B[i][j]; j distance vector [0,1], [0,2], [0,3] direction vector [0,<] i EJCP 2017, June 29, Toulouse 18

  19. Direction Vector Example 2 for (i=1; i<N; i++) for (j=0; j<M; j++) A[i][j] = A[i-1][j+1] + B[i][j]; j distance vector [-1,1] direction vector [>,<] i EJCP 2017, June 29, Toulouse 19

  20. So what does these vectors do? Parallelism is clear [0,0,1] [0,1,0] [1,1,0] [0,0, 1] [0,1, 1] [0,1,-1] [1, 0,0] [1, 1,0] [1,-1,0] same for direction vectors Loop carried-dependence loop at depth dcarries a dependence if at least one of the distance/direction vectors have non-zero entry at d EJCP 2017, June 29, Toulouse 20

  21. Legality of Loop Permutation Another application of distance vectors Which ones can you permute? for (i=1; i<N; i++) for (j=1; j<M; j++) A[i][j] = A[i-1][j-1] + B[i][j]; [1,1] for (i=1; i<N; i++) for (j=1; j<M; j++) A[i][j] = A[i][j-1] + B[i][j]; [0,1] for (i=1; i<N; i++) for (j=1; j<M; j++) A[i][j] = A[i-1][j+1] + B[i][j]; [1,-1] EJCP 2017, June 29, Toulouse 21

  22. Legality of Loop Permutation Another application of distance vectors Which ones can you permute? for (i=1; i<N; i++) for (j=1; j<M; j++) A[i][j] = A[i-1][j-1] + B[i][j]; [1,1] j i EJCP 2017, June 29, Toulouse 22

  23. Legality of Loop Permutation Another application of distance vectors Which ones can you permute? for (i=1; i<N; i++) for (j=1; j<M; j++) A[i][j] = A[i+1][j-1] + B[i][j]; A[i][j] = A[i][j-1] + B[i][j]; for (i=1; i<N; i++) for (j=1; j<M; j++) [0,1] j i EJCP 2017, June 29, Toulouse 23

  24. Legality of Loop Permutation Another application of distance vectors Which ones can you permute? for (i=1; i<N; i++) for (j=1; j<M; j++) A[i][j] = A[i-1][j+1] + B[i][j]; [1,-1] j Fully permutable: [ ,..., ] i EJCP 2017, June 29, Toulouse 24

  25. Outline Introduction Loop Parallelism and Dependences Dependence Tests Polyhedral Model Locality and Tiling EJCP 2017, June 29, Toulouse 25

  26. How to Find the Vectors Easy case for (i=1; i<N; i++) for (j=0; j<M; j++) A[i][j] = A[i-1][j+1] + B[i][j]; Not too easy for (i=1; i<N; i++) for (j=0; j<M; j++) A[i] = A[i] + B[i][j]; for (i=1; i<N; i++) for (j=0; j<M; j++) A[i] = A[2*i-j+3] + B[i][j]; EJCP 2017, June 29, Toulouse 26

  27. How to Find the Vectors Really difficult for (i=1; i<N; i++) for (j=0; j<M; j++) { A[i*i+j*j-i*j] = A[i] + B[i][j]; A[i*j*j-i*j*3] = A[i] + B[i][j]; } No general solution polynomial case is undecidable can work for linear accesses wide range of precise-ness even for linear case EJCP 2017, June 29, Toulouse 27

  28. Dependence Tests: Basic Idea Given two accesses f(i,j) and g(x,y) the two accesses are in conflict if: same location: f(i,j) = g(x,y) one of them is a write Let f and g be affine a0+a1i+a2j = b0+b1x+b2y linear Diophantine equation solution exists if gcd(a1,a2,b1,b2)=|a0-b0| EJCP 2017, June 29, Toulouse 28

  29. GCD Test for (i=1; i<N; i++) for (j=0; j<M; j++) A[3*i] = A[6*i-3*j+2] + B[i][j]; 3i=6x-3y+2 3i-6x+3y=2 gcd(3,6,3) = 2 ? for (i=1; i<N; i++) for (j=0; j<M; j++) A[2*i] = A[4*i-2*j+2] + B[i][j]; 2i=4x-2y+2 2i-4x+2y=2 gcd(2,4,2) = 2 ? EJCP 2017, June 29, Toulouse 29

  30. Why is GCD Test Inexact? When does GCD test give false positive? What happens when GCD=1? for (i=0; i<N; i++) for (j=N; j<M; j++) A[i] = A[j] + B[i][j]; GCD test: i = j trivial solution exist Main problem the space is completely unconstrained EJCP 2017, June 29, Toulouse 30

  31. Banerjee Test [Banerjee 1976] Making it slightly better There may be a dependence if min(f(i,j)-g(x,y)) 0, and 0 max(f(i,j)-g(x,y)) for (i=0; i<N; i++) for (j=N; j<M; j++) { A[i] = A[j] + B[i][j]; } min(i-j) = 0-(M-1) = 1-M max(i-j) = N-1-N = -1 EJCP 2017, June 29, Toulouse 31

  32. Dependence Tests Method for detecting parallelism (probably) used by many production compilers important for vectorization not too good for thread-level parallelism How can we expose parallelism? EJCP 2017, June 29, Toulouse 32

  33. Outline Introduction Loop Parallelism and Dependences Dependence Tests Polyhedral Model Locality and Tiling EJCP 2017, June 29, Toulouse 33

  34. Exact Method: Polyhedral Model Array Dataflow Analysis[Feautrier 1991] Given read and write statement instances r,w Find w as a function of r such that r and w are in conflict w happens-before r w is the most recent write when everything is affine Main Engine Parametric Integer Linear Programming EJCP 2017, June 29, Toulouse 34

  35. Exact Dependence Analysis Who produced the value read at A[j]? for (i=0; i<N; i++) for (j=0; j<M; j++) S: A[i] = A[j] + B[i][j]; S<i,j> = if i>j and j>0 : S<j,M-1>; if i=j and i>0 : S<j,j-1>; if j>i or i=j=0: A[j]; Powerful but expensive EJCP 2017, June 29, Toulouse 35

  36. Alternative View of the ADA Performs full array expansion arrays are sources of complications recall: for (i=1; i<N; i++) for (j=0; j<M; j++) A[i][j] = A[i-1][j+1] + B[i][j]; Dynamic Single Assignment simplifies analysis ADA converts to such view memory vs parallelism EJCP 2017, June 29, Toulouse 36

  37. Polyhedral Representation High-level abstraction of the program Statement instances: integer polyhedron Dependences: affine functions Usual optimization flow 1. extract polyhedral representation 2. reason/transform the model 3. generate code in the end EJCP 2017, June 29, Toulouse 37

  38. Statement Instances Statements executed at different iterations of the loop Example: for (i=1; i<N; i++) for (j=1; j<=i; j++) S0; N2/2 instances of S0 j Denoted as S0<i,j> Represented as polyhedron j i {i,j|1 i<N, 1 j i} Geometric view 1 j i 1 i i<N EJCP 2017, June 29, Toulouse 38

  39. Dependence Functions Expressed as affine functions of the space Example: 1D Stencil (t,i t-1,i-1) i (t,i t-1,i) (t,i t-1,i+1) Can be domain qualified if i=0: (t,i t-1,i-1) (t,i t-1,i) t EJCP 2017, June 29, Toulouse 39

  40. Loop Transformations Also affine functions loop permutation: (i,j -> j,i) loop skewing: (i,j -> i,i+j) for (i=0; i<N; i++) for (j=0; j<M; j++) S; for (i=0; i<N; i++) for (j=i; j<M+i; j++) S ; Affine loops + affine transformation permits linear programming one of the few successes in parallelization EJCP 2017, June 29, Toulouse 40

  41. Composing Transformations Key strength of the framework for i for j for j ... for i for j for i for i T1 T2 ... ... ... loop world abstraction poly poly EJCP 2017, June 29, Toulouse 41

  42. Polyhedral Loop Transformations Everything is affine iteration space dependences schedule (transformations) memory mapping The set of possible transformations can be explored by ILP find schedule that exposes parallelism what is a good objective function? EJCP 2017, June 29, Toulouse 42

  43. Outline Introduction Loop Parallelism and Dependences Dependence Tests Polyhedral Model Locality and Tiling EJCP 2017, June 29, Toulouse 43

  44. Data Locality 1D Jacobi Stencil for t=0; t<T; t++ for i=1; i<N-1; i++ A[i] = f(B[i], B[i-1], B[i+1]) //swap A and B Poor data locality Data that fit in cache t=1 t=0 EJCP 2017, June 29, Toulouse 44

  45. Tiling Well-known transformation in HPC [Wolfe 87] for t=0; t<T; t++ for i=1; i<N-1; i++ A[i] = f(B[i], B[i-1], B[i+1]) //swap A and B ? Improves data locality Performance improvement my laptop: 5.95s 4.76s (20%) PLuTo: ~1.8x (with older processor) EJCP 2017, June 29, Toulouse 45

  46. So what is Tiling? Loop transformation Effect: change the order of execution for t=0; t<T; t++ for i=1; i<N-1; i++ foo() for tt=0; tt<T; t+=x for ti=1; ti<N-1; ti+=y for t=tt; t<min(tt+x,T); t++ for i=ti; i<min(ti+y,N-1); i++ foo() EJCP 2017, June 29, Toulouse 46

  47. Visualization of Tiling Improve locality through temporal locality Each tile becomes an atomic unit i i t t EJCP 2017, June 29, Toulouse 47

  48. What is Tiling? Loop transformation that doubles the depth for (t=0; t<N; t++) for (i=0; i<M; i++) S i Tile Origins Tile Loops for (x=0; x<N; x+=3) for (y=0; y<M; y+=3) S t EJCP 2017, June 29, Toulouse 48

  49. What is Tiling? Loop transformation that doubles the depth for (t=0; t<N; t++) for (i=0; i<M; i++) S i Point Loops for (t=x; t<x+3; t++) for (i=y; i<y+3; i++) S t EJCP 2017, June 29, Toulouse 49

  50. What is Tiling? Loop transformation that doubles the depth for (t=0; t<N; t++) for (i=0; i<M; i++) S i M for (t=x; t<x+3; t++) for (i=y; i<y+3; i++) S i<min(y+3,M) t EJCP 2017, June 29, Toulouse 50

More Related Content