Compiler Optimizations: Introduction to Loop Transformations

Slide Note

Dive into the world of compiler optimizations focusing on loop transformations. Explore the significance of high-level optimizations, parallelism, and data locality in enhancing program efficiency. Understand the impact of loop transformations on performance and learn about automating optimization processes.

rbuck Follow

Uploaded on Apr 13, 2025 | 0 Views

Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.

Download Presentation

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.

E N D

Presentation Transcript

Research in Compilers and Introduction to Loop Transformations Part II: Compiler Optimizations Tomofumi Yuki EJCP 2016 June 29, Lille

Compiler Optimizations Low-level Optimizations register allocation instruction scheduling constant propagation ... High-level Optimizations loop transformations Focus for today coarse grained parallelism ... EJCP 2016, June 29, Lille 2

High-Level Optimizations Goals: Parallelism and Data Locality Why Parallelism? Why Data Locality? Why High-Level? EJCP 2016, June 29, Lille 3

Why Loop Transformations? The 90/10 Rule 90% of the execution time is spent in less than 10% of the source code Loop Nests hotspot of almost all programs few lines of change => huge impact natural source of parallelism EJCP 2016, June 29, Lille 4

Why Loop Transformations? Which is faster? for (i=0; i<N; i++) for (j=0; j<N; j++) for (k=0; k<N; k++) C[i][j] += A[i][k] * B[k][j]; for (i=0; i<N; i++) for (k=0; k<N; k++) for (j=0; j<N; j++) C[i][j] += A[i][k] * B[k][j]; EJCP 2016, June 29, Lille 5

Why is it Faster? Hardware Prefetching for (i=0; i<N; i++) for (j=0; j<N; j++) for (k=0; k<N; k++) C[i][j] += A[i][k] * B[k][j]; unchanged next col next row for (i=0; i<N; i++) for (k=0; k<N; k++) for (j=0; j<N; j++) C[i][j] += A[i][k] * B[k][j]; next col unchanged EJCP 2016, June 29, Lille next col 6

How to Automate? The most challenging part! The same optimization doesn t work for: for (i=0; i<N; i++) for (j=0; j<N; j++) for (k=0; k<N; k++) { C1[i][j] += A1[i][k] * B1[k][j]; C2[i][j] += A2[i][k] * B2[k][j]; C3[i][j] += A3[i][k] * B3[k][j]; C4[i][j] += A4[i][k] * B4[k][j]; } Why? EJCP 2016, June 29, Lille 7

Its Not Just Transformations Many many reasoning steps: What to apply? How to apply? Compiler Research is all about coming up with techniques/abstractions/representations to allow the compiler to perform deep analysis. When to apply? What is its impact? Quality of the analysis: How long does it take? Can it potentially degrade performance? Provable properties (completeness, etc.) EJCP 2016, June 29, Lille 8

What I Cover Today Some techniques to reason about programs for parallelism and locality data dependences Some transformations tiling tiling tiling Some thing special: Simplifying Reductions EJCP 2016, June 29, Lille 9

Outline Introduction Loop Parallelism and Dependences Dependence Tests Polyhedral Model Locality and Tiling EJCP 2016, June 29, Lille 10

Loop Parallelism Simple transformation forall (i=0; i<N; i++) S; for (i=0; i<N; i++) S; Not so simple to reason about Legality Performance impacts More complicated cases Transform the loops to expose parallelism EJCP 2016, June 29, Lille 11

Analyzing Legality Is there a parallel loop? for (i=0; i<N; i++) for (j=1; j<M; j++) { A[i][j] = A[i][j-1] + B[i][j]; } EJCP 2016, June 29, Lille 12

Extended Graphs Completely unroll the loops for (i=0; i<5; i++) for (j=1; j<4; j++) { A[i][j] = A[i][j-1] + B[i][j]; } A[0][1] = A[0][0] + B[0][1]; A[0][2] = A[0][1] + B[0][2]; A[0][3] = A[0][2] + B[0][3]; A[1][1] = A[1][0] + B[1][1]; A[1][2] = A[1][1] + B[1][2]; A[1][3] = A[1][2] + B[1][3]; .... EJCP 2016, June 29, Lille 13

Extended Graphs Completely unroll the loops for (i=0; i<N; i++) for (j=1; j<M; j++) { A[i][j] = A[i][j-1] + B[i][j]; } The difficulty: program parameters its easy with DAG representation scalability issues what if parameters are not known? EJCP 2016, June 29, Lille 14

Iteration Spaces Need an abstraction for statement instances for (i=0; i<N; i++) for (j=1; j<M; j++) { A[i][j] = A[i][j-1] + B[i][j]; } j instance = integer vector [i,j] space = integer set 0 i<N and 1 j<M i EJCP 2016, June 29, Lille 15

Dependences Express relations between statements flow (true) dependence RAW anti-dependence WAR output dependence WAW input dependence RAR EJCP 2016, June 29, Lille 16

Dependence Abstractions Distance Vector distance between write and read [i,j] + c e.g., [0,1] Direction Vector direction of the instance that uses the value one of <, >, , , =, * e.g., [0,<] less precise, but sometimes sufficient EJCP 2016, June 29, Lille 17

Direction Vector Example 1 for (i=0; i<N; i++) for (j=0; j<M; j++) A[i][j] = A[i][0] + B[i][j]; j distance vector [0,1], [0,2], [0,3] direction vector [0,<] i EJCP 2016, June 29, Lille 18

Direction Vector Example 2 for (i=1; i<N; i++) for (j=0; j<M; j++) A[i][j] = A[i-1][j+1] + B[i][j]; j distance vector [-1,1] direction vector [>,<] i EJCP 2016, June 29, Lille 19

So what does these vectors do? Parallelism is clear [0,0,1] [0,1,0] [1,1,0] [0,0, 1] [0,1, 1] [0,1,-1] [1, 0,0] [1, 1,0] [1,-1,0] same for direction vectors Loop carried-dependence loop at depth dcarries a dependence if at least one of the distance/direction vectors have non-zero entry at d EJCP 2016, June 29, Lille 20

Legality of Loop Permutation Another application of distance vectors Which ones can you permute? for (i=1; i<N; i++) for (j=1; j<M; j++) A[i][j] = A[i-1][j-1] + B[i][j]; [1,1] for (i=1; i<N; i++) for (j=1; j<M; j++) A[i][j] = A[i][j-1] + B[i][j]; [0,1] for (i=1; i<N; i++) for (j=1; j<M; j++) A[i][j] = A[i-1][j+1] + B[i][j]; [1,-1] EJCP 2016, June 29, Lille 21

Legality of Loop Permutation Another application of distance vectors Which ones can you permute? for (i=1; i<N; i++) for (j=1; j<M; j++) A[i][j] = A[i-1][j-1] + B[i][j]; [1,1] j i EJCP 2016, June 29, Lille 22

Legality of Loop Permutation Another application of distance vectors Which ones can you permute? for (i=1; i<N; i++) for (j=1; j<M; j++) A[i][j] = A[i+1][j-1] + B[i][j]; A[i][j] = A[i][j-1] + B[i][j]; for (i=1; i<N; i++) for (j=1; j<M; j++) [0,1] j i EJCP 2016, June 29, Lille 23

Legality of Loop Permutation Another application of distance vectors Which ones can you permute? for (i=1; i<N; i++) for (j=1; j<M; j++) A[i][j] = A[i-1][j+1] + B[i][j]; [1,-1] j Fully permutable: [ ,..., ] i EJCP 2016, June 29, Lille 24

Outline Introduction Loop Parallelism and Dependences Dependence Tests Polyhedral Model Locality and Tiling EJCP 2016, June 29, Lille 25

How to Find the Vectors Easy case for (i=1; i<N; i++) for (j=0; j<M; j++) A[i][j] = A[i-1][j+1] + B[i][j]; Not too easy for (i=1; i<N; i++) for (j=0; j<M; j++) A[i] = A[i] + B[i][j]; for (i=1; i<N; i++) for (j=0; j<M; j++) A[i] = A[2*i-j+3] + B[i][j]; EJCP 2016, June 29, Lille 26

How to Find the Vectors Really difficult for (i=1; i<N; i++) for (j=0; j<M; j++) { A[i*i+j*j-i*j] = A[i] + B[i][j]; A[i*j*j-i*j*3] = A[i] + B[i][j]; } No general solution polynomial case is undecidable can work for linear accesses wide range of precise-ness even for linear case EJCP 2016, June 29, Lille 27

Dependence Tests: Basic Idea Given two accesses f(i,j) and g(x,y) the two accesses are in conflict if: same location: f(i,j) = g(x,y) one of them is a write Let f and g be affine a0+a1i+a2j = b0+b1x+b2y linear Diophantine equation solution exists if gcd(a1,a2,b1,b2)=|a0-b0| EJCP 2016, June 29, Lille 28

GCD Test for (i=1; i<N; i++) for (j=0; j<M; j++) A[3*i] = A[6*i-3*j+2] + B[i][j]; 3i=6x-3y+2 3i-6x+3y=2 gcd(3,6,3) = 2 ? for (i=1; i<N; i++) for (j=0; j<M; j++) A[2*i] = A[4*i-2*j+2] + B[i][j]; 2i=4x-2y+2 2i-4x+2y=2 gcd(2,4,2) = 2 ? EJCP 2016, June 29, Lille 29

Why is GCD Test Inexact? When does GCD test give false positive? What happens when GCD=1? for (i=0; i<N; i++) for (j=N; j<M; j++) A[i] = A[j] + B[i][j]; GCD test: i = j trivial solution exist Main problem the space is completely unconstrained EJCP 2016, June 29, Lille 30

Banerjee Test [Banerjee 1976] Making it slightly better There may be a dependence if min(f(i,j)-g(x,y)) 0, and 0 max(f(i,j)-g(x,y)) for (i=0; i<N; i++) for (j=N; j<M; j++) { A[i] = A[j] + B[i][j]; } min(i-j) = 0-(M-1) = 1-M max(i-j) = N-1-N = -1 EJCP 2016, June 29, Lille 31

Outline Introduction Loop Parallelism and Dependences Dependence Tests Polyhedral Model Locality and Tiling EJCP 2016, June 29, Lille 32

Exact Method: Polyhedral Model Array Dataflow Analysis[Feautrier 1991] Given read and write statement instances r,w Find w as a function of r such that r and w are in conflict w happens-before r w is the most recent write when everything is affine Main Engine Parametric Integer Linear Programming EJCP 2016, June 29, Lille 33

Exact Dependence Analysis Who produced the value read at A[j]? for (i=0; i<N; i++) for (j=0; j<M; j++) S: A[i] = A[j] + B[i][j]; S<i,j> = if i>j and j>0 : S<j,M-1>; if i=j and i>0 : S<j,j-1>; if j>i or i=j=0: A[j]; Powerful but expensive EJCP 2016, June 29, Lille 34

Alternative View of the ADA Performs full array expansion arrays are sources of complications recall: for (i=1; i<N; i++) for (j=0; j<M; j++) A[i][j] = A[i-1][j+1] + B[i][j]; Dynamic Single Assignment simplifies analysis ADA converts to such view memory vs parallelism EJCP 2016, June 29, Lille 35

Polyhedral Representation High-level abstraction of the program Statement instances: integer polyhedron Dependences: affine functions Usual optimization flow 1. extract polyhedral representation 2. reason/transform the model 3. generate code in the end EJCP 2016, June 29, Lille 36

Statement Instances Statements executed at different iterations of the loop Example: for (i=1; i<N; i++) for (j=1; j<=i; j++) S0; N2/2 instances of S0 j Denoted as S0<i,j> Represented as polyhedron j i {i,j|1 i<N, 1 j i} Geometric view 1 j i 1 i i<N EJCP 2016, June 29, Lille 37

Dependence Functions Expressed as affine functions of the space Example: 1D Stencil (t,i t-1,i-1) i (t,i t-1,i) (t,i t-1,i+1) Can be domain qualified if i=0: (t,i t-1,i-1) (t,i t-1,i) t EJCP 2016, June 29, Lille 38

Loop Transformations Also affine functions loop permutation: (i,j -> j,i) loop skewing: (i,j -> i,i+j) for (i=0; i<N; i++) for (j=0; j<M; j++) S; for (i=0; i<N; i++) for (j=i; j<M+i; j++) S ; Affine loops + affine transformation permits linear programming one of the few successes in parallelization EJCP 2016, June 29, Lille 39

Composing Transformations Key strength of the framework for i for j for j ... for i for j for i for i T1 T2 ... ... ... loop world abstraction poly poly EJCP 2016, June 29, Lille 40

Outline Introduction Loop Parallelism and Dependences Dependence Tests Polyhedral Model Locality and Tiling EJCP 2016, June 29, Lille 41

Data Locality 1D Jacobi Stencil for t=0; t<T; t++ for i=1; i<N-1; i++ A[i] = f(B[i], B[i-1], B[i+1]) //swap A and B Poor data locality Data that fit in cache t=1 t=0 EJCP 2016, June 29, Lille 42

Tiling Well-known transformation in HPC [Wolfe 87] for t=0; t<T; t++ for i=1; i<N-1; i++ A[i] = f(B[i], B[i-1], B[i+1]) //swap A and B ? Improves data locality Performance improvement my laptop: 5.95s 4.76s (20%) PLuTo: ~1.8x (with older processor) EJCP 2016, June 29, Lille 43

So what is Tiling? Loop transformation Effect: change the order of execution for t=0; t<T; t++ for i=1; i<N-1; i++ foo() for tt=0; tt<T; t+=x for ti=1; ti<N-1; ti+=y for t=tt; t<min(tt+x,T); t++ for i=ti; i<min(ti+y,N-1); i++ foo() EJCP 2016, June 29, Lille 44

Visualization of Tiling Improve locality through temporal locality Each tile becomes an atomic unit i i t t EJCP 2016, June 29, Lille 45

What is Tiling? Loop transformation that doubles the depth for (t=0; t<N; t++) for (i=0; i<M; i++) S i Tile Origins Tile Loops for (x=0; x<N; x+=3) for (y=0; y<M; y+=3) S t EJCP 2016, June 29, Lille 46

What is Tiling? Loop transformation that doubles the depth for (t=0; t<N; t++) for (i=0; i<M; i++) S i Point Loops for (t=x; t<x+3; t++) for (i=y; i<y+3; i++) S t EJCP 2016, June 29, Lille 47

What is Tiling? Loop transformation that doubles the depth for (t=0; t<N; t++) for (i=0; i<M; i++) S i M for (t=x; t<x+3; t++) for (i=y; i<y+3; i++) S i<min(y+3,M) t EJCP 2016, June 29, Lille 48

What is Tiling? Loop transformation that doubles the depth for (t=0; t<N; t++) for (i=0; i<M; i++) S i for (x=0; x<N; x+=3) for (y=0; y<M; y+=3) S for (t=x; t<min(x+3,N); t++) for (i=y;i<min(y+3,M);i++) S t EJCP 2016, June 29, Lille 49

What is Tiling? Loop transformation that doubles the depth for (t=0; t<N; t++) for (i=0; i<M; i++) S i for (x=0; x<N; x+=3) for (y=0; y<M; y+=3) for (t=x; t<min(x+3,N); t++) for (i=y;i<min(y+3,M);i++) S t EJCP 2016, June 29, Lille 50

Compiler Optimizations: Introduction to Loop Transformations

Download Presentation

Presentation Transcript

Related

More Related Content