Introduction to Loop Transformations in Compiler Optimizations
Delve into the world of compiler optimizations with a focus on loop transformations. Explore the significance of high-level optimizations, parallelism, and data locality. Understand the impact of loop transformations on program performance and learn why they are crucial for achieving faster execution times. Discover the challenges in automating optimizations and the intricacies of applying transformations effectively.
Download Presentation

Please find below an Image/Link to download the presentation.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.
You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.
E N D
Presentation Transcript
Research in Compilers and Introduction to Loop Transformations Part II: Compiler Optimizations Tomofumi Yuki EJCP 2017 June 29, Toulouse
Compiler Optimizations Low-level Optimizations register allocation instruction scheduling constant propagation ... High-level Optimizations loop transformations Focus for today coarse grained parallelism ... EJCP 2017, June 29, Toulouse 2
High-Level Optimizations Goals: Parallelism and Data Locality Why Parallelism? Why Data Locality? Why High-Level? EJCP 2017, June 29, Toulouse 3
Why Loop Transformations? The 90/10 Rule 90% of the execution time is spent in less than 10% of the source code Loop Nests hotspot of almost all programs few lines of change => huge impact natural source of parallelism EJCP 2017, June 29, Toulouse 4
Why Loop Transformations? Which is faster? for (i=0; i<N; i++) for (j=0; j<N; j++) for (k=0; k<N; k++) C[i][j] += A[i][k] * B[k][j]; for (i=0; i<N; i++) for (k=0; k<N; k++) for (j=0; j<N; j++) C[i][j] += A[i][k] * B[k][j]; EJCP 2017, June 29, Toulouse 5
Why is it Faster? Hardware Prefetching for (i=0; i<N; i++) for (j=0; j<N; j++) for (k=0; k<N; k++) C[i][j] += A[i][k] * B[k][j]; unchanged next col next row for (i=0; i<N; i++) for (k=0; k<N; k++) for (j=0; j<N; j++) C[i][j] += A[i][k] * B[k][j]; next col unchanged EJCP 2017, June 29, Toulouse next col 6
How to Automate? The most challenging part! The same optimization doesn t work for: for (i=0; i<N; i++) for (j=0; j<N; j++) for (k=0; k<N; k++) { C1[i][j] += A1[i][k] * B1[k][j]; C2[i][j] += A2[i][k] * B2[k][j]; C3[i][j] += A3[i][k] * B3[k][j]; C4[i][j] += A4[i][k] * B4[k][j]; } Why? EJCP 2017, June 29, Toulouse 7
Its Not Just Transformations Many many reasoning steps: What to apply? How to apply? Compiler Research is all about coming up with techniques/abstractions/representations to allow the compiler to perform deep analysis. When to apply? What is its impact? Quality of the analysis: How long does it take? Can it potentially degrade performance? Provable properties (completeness, etc.) EJCP 2017, June 29, Toulouse 8
What I Cover Today Some techniques to reason about programs for parallelism and locality data dependences Some transformations tiling tiling tiling Some thing special: Simplifying Reductions EJCP 2017, June 29, Toulouse 9
Outline Introduction Loop Parallelism and Dependences Dependence Tests Polyhedral Model Locality and Tiling EJCP 2017, June 29, Toulouse 10
Loop Parallelism Simple transformation forall (i=0; i<N; i++) S; for (i=0; i<N; i++) S; Not so simple to reason about Legality Performance impacts More complicated cases Transform the loops to expose parallelism EJCP 2017, June 29, Toulouse 11
Analyzing Legality Is there a parallel loop? for (i=0; i<N; i++) for (j=1; j<M; j++) { A[i][j] = A[i][j-1] + B[i][j]; } EJCP 2017, June 29, Toulouse 12
Extended Graphs Completely unroll the loops for (i=0; i<5; i++) for (j=1; j<4; j++) { A[i][j] = A[i][j-1] + B[i][j]; } A[0][1] = A[0][0] + B[0][1]; A[0][2] = A[0][1] + B[0][2]; A[0][3] = A[0][2] + B[0][3]; A[1][1] = A[1][0] + B[1][1]; A[1][2] = A[1][1] + B[1][2]; A[1][3] = A[1][2] + B[1][3]; .... EJCP 2017, June 29, Toulouse 13
Extended Graphs Completely unroll the loops for (i=0; i<N; i++) for (j=1; j<M; j++) { A[i][j] = A[i][j-1] + B[i][j]; } The difficulty: program parameters its easy with DAG representation scalability issues what if parameters are not known? EJCP 2017, June 29, Toulouse 14
Iteration Spaces Need an abstraction for statement instances for (i=0; i<N; i++) for (j=1; j<M; j++) { A[i][j] = A[i][j-1] + B[i][j]; } j instance = integer vector [i,j] space = integer set 0 i<N and 1 j<M i EJCP 2017, June 29, Toulouse 15
Dependences Express relations between statements flow (true) dependence RAW anti-dependence WAR output dependence WAW input dependence RAR EJCP 2017, June 29, Toulouse 16
Dependence Abstractions Distance Vector distance between write and read [i,j] + c e.g., [0,1] Direction Vector direction of the instance that uses the value one of <, >, , , =, * e.g., [0,<] less precise, but sometimes sufficient EJCP 2017, June 29, Toulouse 17
Direction Vector Example 1 for (i=0; i<N; i++) for (j=0; j<M; j++) A[i][j] = A[i][0] + B[i][j]; j distance vector [0,1], [0,2], [0,3] direction vector [0,<] i EJCP 2017, June 29, Toulouse 18
Direction Vector Example 2 for (i=1; i<N; i++) for (j=0; j<M; j++) A[i][j] = A[i-1][j+1] + B[i][j]; j distance vector [-1,1] direction vector [>,<] i EJCP 2017, June 29, Toulouse 19
So what does these vectors do? Parallelism is clear [0,0,1] [0,1,0] [1,1,0] [0,0, 1] [0,1, 1] [0,1,-1] [1, 0,0] [1, 1,0] [1,-1,0] same for direction vectors Loop carried-dependence loop at depth dcarries a dependence if at least one of the distance/direction vectors have non-zero entry at d EJCP 2017, June 29, Toulouse 20
Legality of Loop Permutation Another application of distance vectors Which ones can you permute? for (i=1; i<N; i++) for (j=1; j<M; j++) A[i][j] = A[i-1][j-1] + B[i][j]; [1,1] for (i=1; i<N; i++) for (j=1; j<M; j++) A[i][j] = A[i][j-1] + B[i][j]; [0,1] for (i=1; i<N; i++) for (j=1; j<M; j++) A[i][j] = A[i-1][j+1] + B[i][j]; [1,-1] EJCP 2017, June 29, Toulouse 21
Legality of Loop Permutation Another application of distance vectors Which ones can you permute? for (i=1; i<N; i++) for (j=1; j<M; j++) A[i][j] = A[i-1][j-1] + B[i][j]; [1,1] j i EJCP 2017, June 29, Toulouse 22
Legality of Loop Permutation Another application of distance vectors Which ones can you permute? for (i=1; i<N; i++) for (j=1; j<M; j++) A[i][j] = A[i+1][j-1] + B[i][j]; A[i][j] = A[i][j-1] + B[i][j]; for (i=1; i<N; i++) for (j=1; j<M; j++) [0,1] j i EJCP 2017, June 29, Toulouse 23
Legality of Loop Permutation Another application of distance vectors Which ones can you permute? for (i=1; i<N; i++) for (j=1; j<M; j++) A[i][j] = A[i-1][j+1] + B[i][j]; [1,-1] j Fully permutable: [ ,..., ] i EJCP 2017, June 29, Toulouse 24
Outline Introduction Loop Parallelism and Dependences Dependence Tests Polyhedral Model Locality and Tiling EJCP 2017, June 29, Toulouse 25
How to Find the Vectors Easy case for (i=1; i<N; i++) for (j=0; j<M; j++) A[i][j] = A[i-1][j+1] + B[i][j]; Not too easy for (i=1; i<N; i++) for (j=0; j<M; j++) A[i] = A[i] + B[i][j]; for (i=1; i<N; i++) for (j=0; j<M; j++) A[i] = A[2*i-j+3] + B[i][j]; EJCP 2017, June 29, Toulouse 26
How to Find the Vectors Really difficult for (i=1; i<N; i++) for (j=0; j<M; j++) { A[i*i+j*j-i*j] = A[i] + B[i][j]; A[i*j*j-i*j*3] = A[i] + B[i][j]; } No general solution polynomial case is undecidable can work for linear accesses wide range of precise-ness even for linear case EJCP 2017, June 29, Toulouse 27
Dependence Tests: Basic Idea Given two accesses f(i,j) and g(x,y) the two accesses are in conflict if: same location: f(i,j) = g(x,y) one of them is a write Let f and g be affine a0+a1i+a2j = b0+b1x+b2y linear Diophantine equation solution exists if gcd(a1,a2,b1,b2)=|a0-b0| EJCP 2017, June 29, Toulouse 28
GCD Test for (i=1; i<N; i++) for (j=0; j<M; j++) A[3*i] = A[6*i-3*j+2] + B[i][j]; 3i=6x-3y+2 3i-6x+3y=2 gcd(3,6,3) = 2 ? for (i=1; i<N; i++) for (j=0; j<M; j++) A[2*i] = A[4*i-2*j+2] + B[i][j]; 2i=4x-2y+2 2i-4x+2y=2 gcd(2,4,2) = 2 ? EJCP 2017, June 29, Toulouse 29
Why is GCD Test Inexact? When does GCD test give false positive? What happens when GCD=1? for (i=0; i<N; i++) for (j=N; j<M; j++) A[i] = A[j] + B[i][j]; GCD test: i = j trivial solution exist Main problem the space is completely unconstrained EJCP 2017, June 29, Toulouse 30
Banerjee Test [Banerjee 1976] Making it slightly better There may be a dependence if min(f(i,j)-g(x,y)) 0, and 0 max(f(i,j)-g(x,y)) for (i=0; i<N; i++) for (j=N; j<M; j++) { A[i] = A[j] + B[i][j]; } min(i-j) = 0-(M-1) = 1-M max(i-j) = N-1-N = -1 EJCP 2017, June 29, Toulouse 31
Dependence Tests Method for detecting parallelism (probably) used by many production compilers important for vectorization not too good for thread-level parallelism How can we expose parallelism? EJCP 2017, June 29, Toulouse 32
Outline Introduction Loop Parallelism and Dependences Dependence Tests Polyhedral Model Locality and Tiling EJCP 2017, June 29, Toulouse 33
Exact Method: Polyhedral Model Array Dataflow Analysis[Feautrier 1991] Given read and write statement instances r,w Find w as a function of r such that r and w are in conflict w happens-before r w is the most recent write when everything is affine Main Engine Parametric Integer Linear Programming EJCP 2017, June 29, Toulouse 34
Exact Dependence Analysis Who produced the value read at A[j]? for (i=0; i<N; i++) for (j=0; j<M; j++) S: A[i] = A[j] + B[i][j]; S<i,j> = if i>j and j>0 : S<j,M-1>; if i=j and i>0 : S<j,j-1>; if j>i or i=j=0: A[j]; Powerful but expensive EJCP 2017, June 29, Toulouse 35
Alternative View of the ADA Performs full array expansion arrays are sources of complications recall: for (i=1; i<N; i++) for (j=0; j<M; j++) A[i][j] = A[i-1][j+1] + B[i][j]; Dynamic Single Assignment simplifies analysis ADA converts to such view memory vs parallelism EJCP 2017, June 29, Toulouse 36
Polyhedral Representation High-level abstraction of the program Statement instances: integer polyhedron Dependences: affine functions Usual optimization flow 1. extract polyhedral representation 2. reason/transform the model 3. generate code in the end EJCP 2017, June 29, Toulouse 37
Statement Instances Statements executed at different iterations of the loop Example: for (i=1; i<N; i++) for (j=1; j<=i; j++) S0; N2/2 instances of S0 j Denoted as S0<i,j> Represented as polyhedron j i {i,j|1 i<N, 1 j i} Geometric view 1 j i 1 i i<N EJCP 2017, June 29, Toulouse 38
Dependence Functions Expressed as affine functions of the space Example: 1D Stencil (t,i t-1,i-1) i (t,i t-1,i) (t,i t-1,i+1) Can be domain qualified if i=0: (t,i t-1,i-1) (t,i t-1,i) t EJCP 2017, June 29, Toulouse 39
Loop Transformations Also affine functions loop permutation: (i,j -> j,i) loop skewing: (i,j -> i,i+j) for (i=0; i<N; i++) for (j=0; j<M; j++) S; for (i=0; i<N; i++) for (j=i; j<M+i; j++) S ; Affine loops + affine transformation permits linear programming one of the few successes in parallelization EJCP 2017, June 29, Toulouse 40
Composing Transformations Key strength of the framework for i for j for j ... for i for j for i for i T1 T2 ... ... ... loop world abstraction poly poly EJCP 2017, June 29, Toulouse 41
Polyhedral Loop Transformations Everything is affine iteration space dependences schedule (transformations) memory mapping The set of possible transformations can be explored by ILP find schedule that exposes parallelism what is a good objective function? EJCP 2017, June 29, Toulouse 42
Outline Introduction Loop Parallelism and Dependences Dependence Tests Polyhedral Model Locality and Tiling EJCP 2017, June 29, Toulouse 43
Data Locality 1D Jacobi Stencil for t=0; t<T; t++ for i=1; i<N-1; i++ A[i] = f(B[i], B[i-1], B[i+1]) //swap A and B Poor data locality Data that fit in cache t=1 t=0 EJCP 2017, June 29, Toulouse 44
Tiling Well-known transformation in HPC [Wolfe 87] for t=0; t<T; t++ for i=1; i<N-1; i++ A[i] = f(B[i], B[i-1], B[i+1]) //swap A and B ? Improves data locality Performance improvement my laptop: 5.95s 4.76s (20%) PLuTo: ~1.8x (with older processor) EJCP 2017, June 29, Toulouse 45
So what is Tiling? Loop transformation Effect: change the order of execution for t=0; t<T; t++ for i=1; i<N-1; i++ foo() for tt=0; tt<T; t+=x for ti=1; ti<N-1; ti+=y for t=tt; t<min(tt+x,T); t++ for i=ti; i<min(ti+y,N-1); i++ foo() EJCP 2017, June 29, Toulouse 46
Visualization of Tiling Improve locality through temporal locality Each tile becomes an atomic unit i i t t EJCP 2017, June 29, Toulouse 47
What is Tiling? Loop transformation that doubles the depth for (t=0; t<N; t++) for (i=0; i<M; i++) S i Tile Origins Tile Loops for (x=0; x<N; x+=3) for (y=0; y<M; y+=3) S t EJCP 2017, June 29, Toulouse 48
What is Tiling? Loop transformation that doubles the depth for (t=0; t<N; t++) for (i=0; i<M; i++) S i Point Loops for (t=x; t<x+3; t++) for (i=y; i<y+3; i++) S t EJCP 2017, June 29, Toulouse 49
What is Tiling? Loop transformation that doubles the depth for (t=0; t<N; t++) for (i=0; i<M; i++) S i M for (t=x; t<x+3; t++) for (i=y; i<y+3; i++) S i<min(y+3,M) t EJCP 2017, June 29, Toulouse 50