Compiler Optimizations: Introduction to Loop Transformations

Compiler Optimizations: Introduction to Loop Transformations
Slide Note
Embed
Share

Dive into the world of compiler optimizations focusing on loop transformations. Explore the significance of high-level optimizations, parallelism, and data locality in enhancing program efficiency. Understand the impact of loop transformations on performance and learn about automating optimization processes.

  • Compiler Optimizations
  • Loop Transformations
  • High-Level Optimizations
  • Parallelism
  • Data Locality

Uploaded on Apr 13, 2025 | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.

E N D

Presentation Transcript


  1. Research in Compilers and Introduction to Loop Transformations Part II: Compiler Optimizations Tomofumi Yuki EJCP 2016 June 29, Lille

  2. Compiler Optimizations Low-level Optimizations register allocation instruction scheduling constant propagation ... High-level Optimizations loop transformations Focus for today coarse grained parallelism ... EJCP 2016, June 29, Lille 2

  3. High-Level Optimizations Goals: Parallelism and Data Locality Why Parallelism? Why Data Locality? Why High-Level? EJCP 2016, June 29, Lille 3

  4. Why Loop Transformations? The 90/10 Rule 90% of the execution time is spent in less than 10% of the source code Loop Nests hotspot of almost all programs few lines of change => huge impact natural source of parallelism EJCP 2016, June 29, Lille 4

  5. Why Loop Transformations? Which is faster? for (i=0; i<N; i++) for (j=0; j<N; j++) for (k=0; k<N; k++) C[i][j] += A[i][k] * B[k][j]; for (i=0; i<N; i++) for (k=0; k<N; k++) for (j=0; j<N; j++) C[i][j] += A[i][k] * B[k][j]; EJCP 2016, June 29, Lille 5

  6. Why is it Faster? Hardware Prefetching for (i=0; i<N; i++) for (j=0; j<N; j++) for (k=0; k<N; k++) C[i][j] += A[i][k] * B[k][j]; unchanged next col next row for (i=0; i<N; i++) for (k=0; k<N; k++) for (j=0; j<N; j++) C[i][j] += A[i][k] * B[k][j]; next col unchanged EJCP 2016, June 29, Lille next col 6

  7. How to Automate? The most challenging part! The same optimization doesn t work for: for (i=0; i<N; i++) for (j=0; j<N; j++) for (k=0; k<N; k++) { C1[i][j] += A1[i][k] * B1[k][j]; C2[i][j] += A2[i][k] * B2[k][j]; C3[i][j] += A3[i][k] * B3[k][j]; C4[i][j] += A4[i][k] * B4[k][j]; } Why? EJCP 2016, June 29, Lille 7

  8. Its Not Just Transformations Many many reasoning steps: What to apply? How to apply? Compiler Research is all about coming up with techniques/abstractions/representations to allow the compiler to perform deep analysis. When to apply? What is its impact? Quality of the analysis: How long does it take? Can it potentially degrade performance? Provable properties (completeness, etc.) EJCP 2016, June 29, Lille 8

  9. What I Cover Today Some techniques to reason about programs for parallelism and locality data dependences Some transformations tiling tiling tiling Some thing special: Simplifying Reductions EJCP 2016, June 29, Lille 9

  10. Outline Introduction Loop Parallelism and Dependences Dependence Tests Polyhedral Model Locality and Tiling EJCP 2016, June 29, Lille 10

  11. Loop Parallelism Simple transformation forall (i=0; i<N; i++) S; for (i=0; i<N; i++) S; Not so simple to reason about Legality Performance impacts More complicated cases Transform the loops to expose parallelism EJCP 2016, June 29, Lille 11

  12. Analyzing Legality Is there a parallel loop? for (i=0; i<N; i++) for (j=1; j<M; j++) { A[i][j] = A[i][j-1] + B[i][j]; } EJCP 2016, June 29, Lille 12

  13. Extended Graphs Completely unroll the loops for (i=0; i<5; i++) for (j=1; j<4; j++) { A[i][j] = A[i][j-1] + B[i][j]; } A[0][1] = A[0][0] + B[0][1]; A[0][2] = A[0][1] + B[0][2]; A[0][3] = A[0][2] + B[0][3]; A[1][1] = A[1][0] + B[1][1]; A[1][2] = A[1][1] + B[1][2]; A[1][3] = A[1][2] + B[1][3]; .... EJCP 2016, June 29, Lille 13

  14. Extended Graphs Completely unroll the loops for (i=0; i<N; i++) for (j=1; j<M; j++) { A[i][j] = A[i][j-1] + B[i][j]; } The difficulty: program parameters its easy with DAG representation scalability issues what if parameters are not known? EJCP 2016, June 29, Lille 14

  15. Iteration Spaces Need an abstraction for statement instances for (i=0; i<N; i++) for (j=1; j<M; j++) { A[i][j] = A[i][j-1] + B[i][j]; } j instance = integer vector [i,j] space = integer set 0 i<N and 1 j<M i EJCP 2016, June 29, Lille 15

  16. Dependences Express relations between statements flow (true) dependence RAW anti-dependence WAR output dependence WAW input dependence RAR EJCP 2016, June 29, Lille 16

  17. Dependence Abstractions Distance Vector distance between write and read [i,j] + c e.g., [0,1] Direction Vector direction of the instance that uses the value one of <, >, , , =, * e.g., [0,<] less precise, but sometimes sufficient EJCP 2016, June 29, Lille 17

  18. Direction Vector Example 1 for (i=0; i<N; i++) for (j=0; j<M; j++) A[i][j] = A[i][0] + B[i][j]; j distance vector [0,1], [0,2], [0,3] direction vector [0,<] i EJCP 2016, June 29, Lille 18

  19. Direction Vector Example 2 for (i=1; i<N; i++) for (j=0; j<M; j++) A[i][j] = A[i-1][j+1] + B[i][j]; j distance vector [-1,1] direction vector [>,<] i EJCP 2016, June 29, Lille 19

  20. So what does these vectors do? Parallelism is clear [0,0,1] [0,1,0] [1,1,0] [0,0, 1] [0,1, 1] [0,1,-1] [1, 0,0] [1, 1,0] [1,-1,0] same for direction vectors Loop carried-dependence loop at depth dcarries a dependence if at least one of the distance/direction vectors have non-zero entry at d EJCP 2016, June 29, Lille 20

  21. Legality of Loop Permutation Another application of distance vectors Which ones can you permute? for (i=1; i<N; i++) for (j=1; j<M; j++) A[i][j] = A[i-1][j-1] + B[i][j]; [1,1] for (i=1; i<N; i++) for (j=1; j<M; j++) A[i][j] = A[i][j-1] + B[i][j]; [0,1] for (i=1; i<N; i++) for (j=1; j<M; j++) A[i][j] = A[i-1][j+1] + B[i][j]; [1,-1] EJCP 2016, June 29, Lille 21

  22. Legality of Loop Permutation Another application of distance vectors Which ones can you permute? for (i=1; i<N; i++) for (j=1; j<M; j++) A[i][j] = A[i-1][j-1] + B[i][j]; [1,1] j i EJCP 2016, June 29, Lille 22

  23. Legality of Loop Permutation Another application of distance vectors Which ones can you permute? for (i=1; i<N; i++) for (j=1; j<M; j++) A[i][j] = A[i+1][j-1] + B[i][j]; A[i][j] = A[i][j-1] + B[i][j]; for (i=1; i<N; i++) for (j=1; j<M; j++) [0,1] j i EJCP 2016, June 29, Lille 23

  24. Legality of Loop Permutation Another application of distance vectors Which ones can you permute? for (i=1; i<N; i++) for (j=1; j<M; j++) A[i][j] = A[i-1][j+1] + B[i][j]; [1,-1] j Fully permutable: [ ,..., ] i EJCP 2016, June 29, Lille 24

  25. Outline Introduction Loop Parallelism and Dependences Dependence Tests Polyhedral Model Locality and Tiling EJCP 2016, June 29, Lille 25

  26. How to Find the Vectors Easy case for (i=1; i<N; i++) for (j=0; j<M; j++) A[i][j] = A[i-1][j+1] + B[i][j]; Not too easy for (i=1; i<N; i++) for (j=0; j<M; j++) A[i] = A[i] + B[i][j]; for (i=1; i<N; i++) for (j=0; j<M; j++) A[i] = A[2*i-j+3] + B[i][j]; EJCP 2016, June 29, Lille 26

  27. How to Find the Vectors Really difficult for (i=1; i<N; i++) for (j=0; j<M; j++) { A[i*i+j*j-i*j] = A[i] + B[i][j]; A[i*j*j-i*j*3] = A[i] + B[i][j]; } No general solution polynomial case is undecidable can work for linear accesses wide range of precise-ness even for linear case EJCP 2016, June 29, Lille 27

  28. Dependence Tests: Basic Idea Given two accesses f(i,j) and g(x,y) the two accesses are in conflict if: same location: f(i,j) = g(x,y) one of them is a write Let f and g be affine a0+a1i+a2j = b0+b1x+b2y linear Diophantine equation solution exists if gcd(a1,a2,b1,b2)=|a0-b0| EJCP 2016, June 29, Lille 28

  29. GCD Test for (i=1; i<N; i++) for (j=0; j<M; j++) A[3*i] = A[6*i-3*j+2] + B[i][j]; 3i=6x-3y+2 3i-6x+3y=2 gcd(3,6,3) = 2 ? for (i=1; i<N; i++) for (j=0; j<M; j++) A[2*i] = A[4*i-2*j+2] + B[i][j]; 2i=4x-2y+2 2i-4x+2y=2 gcd(2,4,2) = 2 ? EJCP 2016, June 29, Lille 29

  30. Why is GCD Test Inexact? When does GCD test give false positive? What happens when GCD=1? for (i=0; i<N; i++) for (j=N; j<M; j++) A[i] = A[j] + B[i][j]; GCD test: i = j trivial solution exist Main problem the space is completely unconstrained EJCP 2016, June 29, Lille 30

  31. Banerjee Test [Banerjee 1976] Making it slightly better There may be a dependence if min(f(i,j)-g(x,y)) 0, and 0 max(f(i,j)-g(x,y)) for (i=0; i<N; i++) for (j=N; j<M; j++) { A[i] = A[j] + B[i][j]; } min(i-j) = 0-(M-1) = 1-M max(i-j) = N-1-N = -1 EJCP 2016, June 29, Lille 31

  32. Outline Introduction Loop Parallelism and Dependences Dependence Tests Polyhedral Model Locality and Tiling EJCP 2016, June 29, Lille 32

  33. Exact Method: Polyhedral Model Array Dataflow Analysis[Feautrier 1991] Given read and write statement instances r,w Find w as a function of r such that r and w are in conflict w happens-before r w is the most recent write when everything is affine Main Engine Parametric Integer Linear Programming EJCP 2016, June 29, Lille 33

  34. Exact Dependence Analysis Who produced the value read at A[j]? for (i=0; i<N; i++) for (j=0; j<M; j++) S: A[i] = A[j] + B[i][j]; S<i,j> = if i>j and j>0 : S<j,M-1>; if i=j and i>0 : S<j,j-1>; if j>i or i=j=0: A[j]; Powerful but expensive EJCP 2016, June 29, Lille 34

  35. Alternative View of the ADA Performs full array expansion arrays are sources of complications recall: for (i=1; i<N; i++) for (j=0; j<M; j++) A[i][j] = A[i-1][j+1] + B[i][j]; Dynamic Single Assignment simplifies analysis ADA converts to such view memory vs parallelism EJCP 2016, June 29, Lille 35

  36. Polyhedral Representation High-level abstraction of the program Statement instances: integer polyhedron Dependences: affine functions Usual optimization flow 1. extract polyhedral representation 2. reason/transform the model 3. generate code in the end EJCP 2016, June 29, Lille 36

  37. Statement Instances Statements executed at different iterations of the loop Example: for (i=1; i<N; i++) for (j=1; j<=i; j++) S0; N2/2 instances of S0 j Denoted as S0<i,j> Represented as polyhedron j i {i,j|1 i<N, 1 j i} Geometric view 1 j i 1 i i<N EJCP 2016, June 29, Lille 37

  38. Dependence Functions Expressed as affine functions of the space Example: 1D Stencil (t,i t-1,i-1) i (t,i t-1,i) (t,i t-1,i+1) Can be domain qualified if i=0: (t,i t-1,i-1) (t,i t-1,i) t EJCP 2016, June 29, Lille 38

  39. Loop Transformations Also affine functions loop permutation: (i,j -> j,i) loop skewing: (i,j -> i,i+j) for (i=0; i<N; i++) for (j=0; j<M; j++) S; for (i=0; i<N; i++) for (j=i; j<M+i; j++) S ; Affine loops + affine transformation permits linear programming one of the few successes in parallelization EJCP 2016, June 29, Lille 39

  40. Composing Transformations Key strength of the framework for i for j for j ... for i for j for i for i T1 T2 ... ... ... loop world abstraction poly poly EJCP 2016, June 29, Lille 40

  41. Outline Introduction Loop Parallelism and Dependences Dependence Tests Polyhedral Model Locality and Tiling EJCP 2016, June 29, Lille 41

  42. Data Locality 1D Jacobi Stencil for t=0; t<T; t++ for i=1; i<N-1; i++ A[i] = f(B[i], B[i-1], B[i+1]) //swap A and B Poor data locality Data that fit in cache t=1 t=0 EJCP 2016, June 29, Lille 42

  43. Tiling Well-known transformation in HPC [Wolfe 87] for t=0; t<T; t++ for i=1; i<N-1; i++ A[i] = f(B[i], B[i-1], B[i+1]) //swap A and B ? Improves data locality Performance improvement my laptop: 5.95s 4.76s (20%) PLuTo: ~1.8x (with older processor) EJCP 2016, June 29, Lille 43

  44. So what is Tiling? Loop transformation Effect: change the order of execution for t=0; t<T; t++ for i=1; i<N-1; i++ foo() for tt=0; tt<T; t+=x for ti=1; ti<N-1; ti+=y for t=tt; t<min(tt+x,T); t++ for i=ti; i<min(ti+y,N-1); i++ foo() EJCP 2016, June 29, Lille 44

  45. Visualization of Tiling Improve locality through temporal locality Each tile becomes an atomic unit i i t t EJCP 2016, June 29, Lille 45

  46. What is Tiling? Loop transformation that doubles the depth for (t=0; t<N; t++) for (i=0; i<M; i++) S i Tile Origins Tile Loops for (x=0; x<N; x+=3) for (y=0; y<M; y+=3) S t EJCP 2016, June 29, Lille 46

  47. What is Tiling? Loop transformation that doubles the depth for (t=0; t<N; t++) for (i=0; i<M; i++) S i Point Loops for (t=x; t<x+3; t++) for (i=y; i<y+3; i++) S t EJCP 2016, June 29, Lille 47

  48. What is Tiling? Loop transformation that doubles the depth for (t=0; t<N; t++) for (i=0; i<M; i++) S i M for (t=x; t<x+3; t++) for (i=y; i<y+3; i++) S i<min(y+3,M) t EJCP 2016, June 29, Lille 48

  49. What is Tiling? Loop transformation that doubles the depth for (t=0; t<N; t++) for (i=0; i<M; i++) S i for (x=0; x<N; x+=3) for (y=0; y<M; y+=3) S for (t=x; t<min(x+3,N); t++) for (i=y;i<min(y+3,M);i++) S t EJCP 2016, June 29, Lille 49

  50. What is Tiling? Loop transformation that doubles the depth for (t=0; t<N; t++) for (i=0; i<M; i++) S i for (x=0; x<N; x+=3) for (y=0; y<M; y+=3) for (t=x; t<min(x+3,N); t++) for (i=y;i<min(y+3,M);i++) S t EJCP 2016, June 29, Lille 50

More Related Content