Introduction to Loop Transformations in Compiler Optimizations

Slide Note

Delve into the world of compiler optimizations with a focus on loop transformations. Explore the significance of high-level optimizations, parallelism, and data locality. Understand the impact of loop transformations on program performance and learn why they are crucial for achieving faster execution times. Discover the challenges in automating optimizations and the intricacies of applying transformations effectively.

izza_992 Follow

Uploaded on Feb 27, 2025 | 0 Views

Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.

Download Presentation

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.

E N D

Presentation Transcript

Research in Compilers and Introduction to Loop Transformations Part II: Compiler Optimizations Tomofumi Yuki EJCP 2017 June 29, Toulouse

Compiler Optimizations Low-level Optimizations register allocation instruction scheduling constant propagation ... High-level Optimizations loop transformations Focus for today coarse grained parallelism ... EJCP 2017, June 29, Toulouse 2

High-Level Optimizations Goals: Parallelism and Data Locality Why Parallelism? Why Data Locality? Why High-Level? EJCP 2017, June 29, Toulouse 3

Why Loop Transformations? The 90/10 Rule 90% of the execution time is spent in less than 10% of the source code Loop Nests hotspot of almost all programs few lines of change => huge impact natural source of parallelism EJCP 2017, June 29, Toulouse 4

Why Loop Transformations? Which is faster? for (i=0; i<N; i++) for (j=0; j<N; j++) for (k=0; k<N; k++) C[i][j] += A[i][k] * B[k][j]; for (i=0; i<N; i++) for (k=0; k<N; k++) for (j=0; j<N; j++) C[i][j] += A[i][k] * B[k][j]; EJCP 2017, June 29, Toulouse 5

Why is it Faster? Hardware Prefetching for (i=0; i<N; i++) for (j=0; j<N; j++) for (k=0; k<N; k++) C[i][j] += A[i][k] * B[k][j]; unchanged next col next row for (i=0; i<N; i++) for (k=0; k<N; k++) for (j=0; j<N; j++) C[i][j] += A[i][k] * B[k][j]; next col unchanged EJCP 2017, June 29, Toulouse next col 6

How to Automate? The most challenging part! The same optimization doesn t work for: for (i=0; i<N; i++) for (j=0; j<N; j++) for (k=0; k<N; k++) { C1[i][j] += A1[i][k] * B1[k][j]; C2[i][j] += A2[i][k] * B2[k][j]; C3[i][j] += A3[i][k] * B3[k][j]; C4[i][j] += A4[i][k] * B4[k][j]; } Why? EJCP 2017, June 29, Toulouse 7

Its Not Just Transformations Many many reasoning steps: What to apply? How to apply? Compiler Research is all about coming up with techniques/abstractions/representations to allow the compiler to perform deep analysis. When to apply? What is its impact? Quality of the analysis: How long does it take? Can it potentially degrade performance? Provable properties (completeness, etc.) EJCP 2017, June 29, Toulouse 8

What I Cover Today Some techniques to reason about programs for parallelism and locality data dependences Some transformations tiling tiling tiling Some thing special: Simplifying Reductions EJCP 2017, June 29, Toulouse 9

Outline Introduction Loop Parallelism and Dependences Dependence Tests Polyhedral Model Locality and Tiling EJCP 2017, June 29, Toulouse 10

Loop Parallelism Simple transformation forall (i=0; i<N; i++) S; for (i=0; i<N; i++) S; Not so simple to reason about Legality Performance impacts More complicated cases Transform the loops to expose parallelism EJCP 2017, June 29, Toulouse 11

Analyzing Legality Is there a parallel loop? for (i=0; i<N; i++) for (j=1; j<M; j++) { A[i][j] = A[i][j-1] + B[i][j]; } EJCP 2017, June 29, Toulouse 12

Extended Graphs Completely unroll the loops for (i=0; i<5; i++) for (j=1; j<4; j++) { A[i][j] = A[i][j-1] + B[i][j]; } A[0][1] = A[0][0] + B[0][1]; A[0][2] = A[0][1] + B[0][2]; A[0][3] = A[0][2] + B[0][3]; A[1][1] = A[1][0] + B[1][1]; A[1][2] = A[1][1] + B[1][2]; A[1][3] = A[1][2] + B[1][3]; .... EJCP 2017, June 29, Toulouse 13

Extended Graphs Completely unroll the loops for (i=0; i<N; i++) for (j=1; j<M; j++) { A[i][j] = A[i][j-1] + B[i][j]; } The difficulty: program parameters its easy with DAG representation scalability issues what if parameters are not known? EJCP 2017, June 29, Toulouse 14

Iteration Spaces Need an abstraction for statement instances for (i=0; i<N; i++) for (j=1; j<M; j++) { A[i][j] = A[i][j-1] + B[i][j]; } j instance = integer vector [i,j] space = integer set 0 i<N and 1 j<M i EJCP 2017, June 29, Toulouse 15

Dependences Express relations between statements flow (true) dependence RAW anti-dependence WAR output dependence WAW input dependence RAR EJCP 2017, June 29, Toulouse 16

Dependence Abstractions Distance Vector distance between write and read [i,j] + c e.g., [0,1] Direction Vector direction of the instance that uses the value one of <, >, , , =, * e.g., [0,<] less precise, but sometimes sufficient EJCP 2017, June 29, Toulouse 17

Direction Vector Example 1 for (i=0; i<N; i++) for (j=0; j<M; j++) A[i][j] = A[i][0] + B[i][j]; j distance vector [0,1], [0,2], [0,3] direction vector [0,<] i EJCP 2017, June 29, Toulouse 18

Direction Vector Example 2 for (i=1; i<N; i++) for (j=0; j<M; j++) A[i][j] = A[i-1][j+1] + B[i][j]; j distance vector [-1,1] direction vector [>,<] i EJCP 2017, June 29, Toulouse 19

So what does these vectors do? Parallelism is clear [0,0,1] [0,1,0] [1,1,0] [0,0, 1] [0,1, 1] [0,1,-1] [1, 0,0] [1, 1,0] [1,-1,0] same for direction vectors Loop carried-dependence loop at depth dcarries a dependence if at least one of the distance/direction vectors have non-zero entry at d EJCP 2017, June 29, Toulouse 20

Legality of Loop Permutation Another application of distance vectors Which ones can you permute? for (i=1; i<N; i++) for (j=1; j<M; j++) A[i][j] = A[i-1][j-1] + B[i][j]; [1,1] for (i=1; i<N; i++) for (j=1; j<M; j++) A[i][j] = A[i][j-1] + B[i][j]; [0,1] for (i=1; i<N; i++) for (j=1; j<M; j++) A[i][j] = A[i-1][j+1] + B[i][j]; [1,-1] EJCP 2017, June 29, Toulouse 21

Legality of Loop Permutation Another application of distance vectors Which ones can you permute? for (i=1; i<N; i++) for (j=1; j<M; j++) A[i][j] = A[i-1][j-1] + B[i][j]; [1,1] j i EJCP 2017, June 29, Toulouse 22

Legality of Loop Permutation Another application of distance vectors Which ones can you permute? for (i=1; i<N; i++) for (j=1; j<M; j++) A[i][j] = A[i+1][j-1] + B[i][j]; A[i][j] = A[i][j-1] + B[i][j]; for (i=1; i<N; i++) for (j=1; j<M; j++) [0,1] j i EJCP 2017, June 29, Toulouse 23

Legality of Loop Permutation Another application of distance vectors Which ones can you permute? for (i=1; i<N; i++) for (j=1; j<M; j++) A[i][j] = A[i-1][j+1] + B[i][j]; [1,-1] j Fully permutable: [ ,..., ] i EJCP 2017, June 29, Toulouse 24

Outline Introduction Loop Parallelism and Dependences Dependence Tests Polyhedral Model Locality and Tiling EJCP 2017, June 29, Toulouse 25

How to Find the Vectors Easy case for (i=1; i<N; i++) for (j=0; j<M; j++) A[i][j] = A[i-1][j+1] + B[i][j]; Not too easy for (i=1; i<N; i++) for (j=0; j<M; j++) A[i] = A[i] + B[i][j]; for (i=1; i<N; i++) for (j=0; j<M; j++) A[i] = A[2*i-j+3] + B[i][j]; EJCP 2017, June 29, Toulouse 26

How to Find the Vectors Really difficult for (i=1; i<N; i++) for (j=0; j<M; j++) { A[i*i+j*j-i*j] = A[i] + B[i][j]; A[i*j*j-i*j*3] = A[i] + B[i][j]; } No general solution polynomial case is undecidable can work for linear accesses wide range of precise-ness even for linear case EJCP 2017, June 29, Toulouse 27

Dependence Tests: Basic Idea Given two accesses f(i,j) and g(x,y) the two accesses are in conflict if: same location: f(i,j) = g(x,y) one of them is a write Let f and g be affine a0+a1i+a2j = b0+b1x+b2y linear Diophantine equation solution exists if gcd(a1,a2,b1,b2)=|a0-b0| EJCP 2017, June 29, Toulouse 28

GCD Test for (i=1; i<N; i++) for (j=0; j<M; j++) A[3*i] = A[6*i-3*j+2] + B[i][j]; 3i=6x-3y+2 3i-6x+3y=2 gcd(3,6,3) = 2 ? for (i=1; i<N; i++) for (j=0; j<M; j++) A[2*i] = A[4*i-2*j+2] + B[i][j]; 2i=4x-2y+2 2i-4x+2y=2 gcd(2,4,2) = 2 ? EJCP 2017, June 29, Toulouse 29

Why is GCD Test Inexact? When does GCD test give false positive? What happens when GCD=1? for (i=0; i<N; i++) for (j=N; j<M; j++) A[i] = A[j] + B[i][j]; GCD test: i = j trivial solution exist Main problem the space is completely unconstrained EJCP 2017, June 29, Toulouse 30

Banerjee Test [Banerjee 1976] Making it slightly better There may be a dependence if min(f(i,j)-g(x,y)) 0, and 0 max(f(i,j)-g(x,y)) for (i=0; i<N; i++) for (j=N; j<M; j++) { A[i] = A[j] + B[i][j]; } min(i-j) = 0-(M-1) = 1-M max(i-j) = N-1-N = -1 EJCP 2017, June 29, Toulouse 31

Dependence Tests Method for detecting parallelism (probably) used by many production compilers important for vectorization not too good for thread-level parallelism How can we expose parallelism? EJCP 2017, June 29, Toulouse 32

Outline Introduction Loop Parallelism and Dependences Dependence Tests Polyhedral Model Locality and Tiling EJCP 2017, June 29, Toulouse 33

Exact Method: Polyhedral Model Array Dataflow Analysis[Feautrier 1991] Given read and write statement instances r,w Find w as a function of r such that r and w are in conflict w happens-before r w is the most recent write when everything is affine Main Engine Parametric Integer Linear Programming EJCP 2017, June 29, Toulouse 34

Exact Dependence Analysis Who produced the value read at A[j]? for (i=0; i<N; i++) for (j=0; j<M; j++) S: A[i] = A[j] + B[i][j]; S<i,j> = if i>j and j>0 : S<j,M-1>; if i=j and i>0 : S<j,j-1>; if j>i or i=j=0: A[j]; Powerful but expensive EJCP 2017, June 29, Toulouse 35

Alternative View of the ADA Performs full array expansion arrays are sources of complications recall: for (i=1; i<N; i++) for (j=0; j<M; j++) A[i][j] = A[i-1][j+1] + B[i][j]; Dynamic Single Assignment simplifies analysis ADA converts to such view memory vs parallelism EJCP 2017, June 29, Toulouse 36

Polyhedral Representation High-level abstraction of the program Statement instances: integer polyhedron Dependences: affine functions Usual optimization flow 1. extract polyhedral representation 2. reason/transform the model 3. generate code in the end EJCP 2017, June 29, Toulouse 37

Statement Instances Statements executed at different iterations of the loop Example: for (i=1; i<N; i++) for (j=1; j<=i; j++) S0; N2/2 instances of S0 j Denoted as S0<i,j> Represented as polyhedron j i {i,j|1 i<N, 1 j i} Geometric view 1 j i 1 i i<N EJCP 2017, June 29, Toulouse 38

Dependence Functions Expressed as affine functions of the space Example: 1D Stencil (t,i t-1,i-1) i (t,i t-1,i) (t,i t-1,i+1) Can be domain qualified if i=0: (t,i t-1,i-1) (t,i t-1,i) t EJCP 2017, June 29, Toulouse 39

Loop Transformations Also affine functions loop permutation: (i,j -> j,i) loop skewing: (i,j -> i,i+j) for (i=0; i<N; i++) for (j=0; j<M; j++) S; for (i=0; i<N; i++) for (j=i; j<M+i; j++) S ; Affine loops + affine transformation permits linear programming one of the few successes in parallelization EJCP 2017, June 29, Toulouse 40

Composing Transformations Key strength of the framework for i for j for j ... for i for j for i for i T1 T2 ... ... ... loop world abstraction poly poly EJCP 2017, June 29, Toulouse 41

Polyhedral Loop Transformations Everything is affine iteration space dependences schedule (transformations) memory mapping The set of possible transformations can be explored by ILP find schedule that exposes parallelism what is a good objective function? EJCP 2017, June 29, Toulouse 42

Outline Introduction Loop Parallelism and Dependences Dependence Tests Polyhedral Model Locality and Tiling EJCP 2017, June 29, Toulouse 43

Data Locality 1D Jacobi Stencil for t=0; t<T; t++ for i=1; i<N-1; i++ A[i] = f(B[i], B[i-1], B[i+1]) //swap A and B Poor data locality Data that fit in cache t=1 t=0 EJCP 2017, June 29, Toulouse 44

Tiling Well-known transformation in HPC [Wolfe 87] for t=0; t<T; t++ for i=1; i<N-1; i++ A[i] = f(B[i], B[i-1], B[i+1]) //swap A and B ? Improves data locality Performance improvement my laptop: 5.95s 4.76s (20%) PLuTo: ~1.8x (with older processor) EJCP 2017, June 29, Toulouse 45

So what is Tiling? Loop transformation Effect: change the order of execution for t=0; t<T; t++ for i=1; i<N-1; i++ foo() for tt=0; tt<T; t+=x for ti=1; ti<N-1; ti+=y for t=tt; t<min(tt+x,T); t++ for i=ti; i<min(ti+y,N-1); i++ foo() EJCP 2017, June 29, Toulouse 46

Visualization of Tiling Improve locality through temporal locality Each tile becomes an atomic unit i i t t EJCP 2017, June 29, Toulouse 47

What is Tiling? Loop transformation that doubles the depth for (t=0; t<N; t++) for (i=0; i<M; i++) S i Tile Origins Tile Loops for (x=0; x<N; x+=3) for (y=0; y<M; y+=3) S t EJCP 2017, June 29, Toulouse 48

What is Tiling? Loop transformation that doubles the depth for (t=0; t<N; t++) for (i=0; i<M; i++) S i Point Loops for (t=x; t<x+3; t++) for (i=y; i<y+3; i++) S t EJCP 2017, June 29, Toulouse 49

What is Tiling? Loop transformation that doubles the depth for (t=0; t<N; t++) for (i=0; i<M; i++) S i M for (t=x; t<x+3; t++) for (i=y; i<y+3; i++) S i<min(y+3,M) t EJCP 2017, June 29, Toulouse 50

Introduction to Loop Transformations in Compiler Optimizations

Download Presentation

Presentation Transcript

Related

More Related Content