Instruction-Level Parallelism in Processors

lecture 20 n.w
1 / 40
Embed
Share

Delve into the world of Instruction-Level Parallelism (ILP) in processor architectures, exploring concepts such as pipelining, registers, cache, and vectorization. Learn how ILP manifests and how it can enhance processor performance by executing independent instructions in parallel. Discover the benefits of pipelining and the importance of overcoming dependencies for efficient ILP utilization.

  • Parallelism
  • Processors
  • Instruction-Level
  • Pipelining
  • Performance

Uploaded on | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.

E N D

Presentation Transcript


  1. Lecture 20 Instruction Level Parallelism

  2. Todays lecture Pipelining and other forms of Instruction Level Parallelism (ILP) Building an efficient barrier 2 Scott B. Baden / CSE 160 /Wi '16

  3. What makes a processor runfaster? Registers and cache Vectorization (SSE) Instruction level parallelism 3 Scott B. Baden / CSE 160 /Wi '16

  4. What is Instruction-level Parallelism? The potential to execute certain instructions in parallel, because they are independent w = u / z; a = b + c; Any technique for identifying and exploiting such opportunities Static: canbeimplementedby a compiler Dynamic:requireshardwaresupport 4 Scott B. Baden / CSE 160 /Wi '16

  5. Howdoes ILPmanifest itself? Basic blocks Sequencesof instructionthatappearbetweenbranches Usuallyno more than 5 or 6 instructions! Loops for ( i=0; i < N;i++) x[i] = x[i] + s; We can only realize ILP by finding sequences of independent instructions Dependent instructions must be separated by a sufficient amount of time, that is functional unit latency 5 Scott B. Baden / CSE 160 /Wi '16

  6. Pipelining An implementation technique for overlapping instruction execution Like an automobile assembly line, we break the instruction s execution into steps We can then overlap the steps so long as there are no dependences (pipeline hazards) between the steps www.ford.ie 6 Scott B. Baden / CSE 160 /Wi '16

  7. Pipelining Laundry analogy wash (30 min) + dry (40 min) + fold (20 min) = 90 min 6 PM 7 8 9 Sequentialexecution takes 4* 90 min =6 hours Pipelined execution takes 30 + 4*40+ 20= 3.5hours Bandwidth =loads/hour Pipelininghelps bandwidth but not latency (90min) Bandwidthlimitedbyslowest pipelinestage Potentialspeedup =Numberpipestages Time 40 40 40 20 30 40 A B C D 7 Scott B. Baden / CSE 160 /Wi '16

  8. Multi-executionpipeline MIPS R4000 w = u / v; a = b + c; s = q * r; i++; integerunit ex FP/intMultiply IF MEM WB ID m1 m2 m3 m4 m5 m6 m7 FPadder a1 a2 a3 a4 x = y / z; t = c q; a = x +c; FP/intdivider Div (latency = 25) DavidCuller 8 Scott B. Baden / CSE 160 /Wi '16

  9. SandyBridge 9 Scott B. Baden / CSE 160 /Wi '16

  10. PipelineHazards What if we have dependencies? A data dependence is one kind of pipeline hazard, a data hazard, that prevents the next instruction from executing when expected Structural hazards: physical resources aren t available Control hazards: branches or other instructions that change the PC The simplest solution is to stall the pipeline, introducing bubble, but we can do better x = y / z; a = x + c; t = c q; 10 Scott B. Baden / CSE 160 /Wi '16

  11. Limitations ofILP Consider this loop for ( i=0; i<N;i++) x[i] = x[i] + s; L:LD ADDD F4, F0,F2 SD SUBI F0, 0(R1) ; F0 is the vector element ; add the scalar ; store the result ; decrement by 8 ; to previous word ; Branch if R1 0 ; Delayed branch slot 0(R1), F4 R1, R1,#8 BNEZ NOP R1, L 11 Scott B. Baden / CSE 160 /Wi '16

  12. Delays Unit Latency Init Interval for ( i=0; i<N;i++) x[i] = x[i] + s; Integer 0 1 Memory 1 1 FPAdd 3 1 FPMultiply 6 1 FPDiv 24 24 L:LD stall ADDD stall stall SD SUBI stall BNEZ stall F0,0(R1) F4, F0, F2 0(R1), F4 R1, R1,#8 R1,L 12 Scott B. Baden / CSE 160 /Wi '16

  13. InstructionReorderingReducesstalls for ( i=0; i<N;i++) x[i] = x[i] + s; F0, 0(R1) R1, R1,#8 F4, F0, F2 L:LD stall ADDD stall stall SD SUBI stall BNEZ stall F0,0(R1) L:LD SUBI ADDD stall BNEZ SD F4, F0, F2 R1,L 0(R1),F4 0(R1), F4 R1, R1,#8 R1,L 13 Scott B. Baden / CSE 160 /Wi '16

  14. Static schedulinglimits performance We can t always reorder, or even it were possible there is a limit to how far the complier can look The ADDD instruction is stalled on the DIVide stalling further instruction issue, e.g. the SUBD DIV F0, ADDD F10, F0, F8 MULD F12, F8, F14 But MULD doesn t depend on ADDD orDIV If we have and adder and a multiplier unit, one will sit idle uselessly until the DIV finishes F2, F4 DIV MUL ADDD 14 Scott B. Baden / CSE 160 /Wi '16

  15. Dynamic scheduling Idea: Modify the pipeline to enable instructions to execute as soon as their operands become available This is known as out-of-order execution Stalled instructions (MUL) can now proceed normally x = y / z; a = x + c; t = c * q; x = y / z; t = c * q; a = x +c; 15 Scott B. Baden / CSE 160 /Wi '16

  16. Complicationsof dynamicscheduling Dynamically scheduled instructions also complete out of order Increased hardware complexity Stations andexecutionunits Bookkeeping Buffering Slows down theprocessor x = y / z; a = x + c; t = c * q; x = y / z; t = c * q; a = x +c; 16 Scott B. Baden / CSE 160 /Wi '16

  17. Dynamicschedulingsplits the ID stage Issue sub-stage Decode theinstructions Checkforstructuralhazards Read operands sub-stage Waituntil there are no data hazards Readoperands I$ Multithreaded Instruction Buffer C$ L1 Shared Mem R F Operand Select MaryHall MAD SFU 17 Scott B. Baden / CSE 160 /Wi '16

  18. Consequencesof a split ID stage We distinguish between the time when an instruction begins execution, and when it completes Previously, an instruction stalled in the ID stage, and this held up the entire pipeline Instructions can now be in a suspended state, neither stalling the pipeline, nor executing They are waiting on operands - need additional registers to store pending instructions that aren t ready to execute 18 Scott B. Baden / CSE 160 /Wi '16

  19. The threesteps of instructionexecution Issue Getinstructionfrom fetchqueue If thereare availableresourcesbufferthe registeroperandsat station If thereis no suchstation, stall the instruction Execute Monitor internalinterconnectforany needoperands Interceptthose operandsand write to the station Whenboth operandsarepresent,executethe operation Write result Send to the result registersand any waitingunits 19 Scott B. Baden / CSE 160 /Wi '16

  20. Automatic, dynamic loop unrolling L: LD MULTD F4, F0, F2 SD SUBI BNEZ F0, 0(R1) 0(R1), F4 R1, R1, #8 R1, L 20 Scott B. Baden / CSE 160 /Wi '16

  21. Automatic, dynamic loop unrolling Instruction LD Issue Exec X Iteration 1 F0, 0(R1) 21 Scott B. Baden / CSE 160 /Wi '16

  22. Automatic, dynamic loop unrolling Instruction LD MULTD Iteration Issue Exec X X F0, 0(R1) F4, F0, F2 1 1 X 22 Scott B. Baden / CSE 160 /Wi '16

  23. Automatic, dynamic loop unrolling Instruction LD MULTD SD Iteration Issue X X X Exec X 1 1 1 F0, 0(R1) F4, F0, F2 0(R1), F4 23 Scott B. Baden / CSE 160 /Wi '16

  24. Automatic, dynamic loop unrolling Instruction LD MULTD SD LD Iteration Issue X X X X Exec X 1 1 1 2 F0, 0(R1) F4, F0, F2 0(R1), F4 F0, 0(R1) 24 Scott B. Baden / CSE 160 /Wi '16

  25. Automatic, dynamic loop unrolling Instruction LD MULTD SD LD Iteration Issue X X X X Exec X 1 1 1 2 F0, 0(R1) F4, F0, F2 0(R1), F4 F0, 0(R1) X 25 Scott B. Baden / CSE 160 /Wi '16

  26. Automatic, dynamic loop unrolling Instruction LD MULTD SD LD MULTD SD Iteration Issue X X X X X X Exec X 1 1 1 2 2 2 F0, 0(R1) F4, F0, F2 0(R1), F4 F0, 0(R1) F4, F0, F2 0(R1), F4 X 26 Scott B. Baden / CSE 160 /Wi '16

  27. Two schemes for dynamicscheduling Scoreboard CDC66000 Tomasulo s algorithm IBM360/91 We ll vary the number of functional units, their latency, and functional unit pipelining 27 Scott B. Baden / CSE 160 /Wi '16

  28. What is ascoreboarding? A technique that allows instructions to execute out of order So long as there are sufficient resources and No datadependencies The goal of scoreboarding Maintain an execution rate of one instruction per clock cycle 28 Scott B. Baden / CSE 160 /Wi '16

  29. ScoreboardingStrategy Instruction Fetch Hardware data structures keep track of Issue & Resolve Scoreboard FU When instructionscomplete Which instructionsdepend on the results When it s safe to write a reg. Deals with data hazards WAR (Write after read) RAW (Read after write) WAW (Write afterwrite) op fetch op fetch ex ex a = b*c x= y b q = a / y y = x +b 29 Scott B. Baden / CSE 160 /Wi '16

  30. Where do RAW hazardsappear? A. B. C. D. (3) E. (3) and(4) (2) (3) (4) (2) and 30 Scott B. Baden / CSE 160 /Wi '16

  31. Where do RAW hazardsappear? A. (2) B. (3) C. (4) D. (2) and(3) E. (3) and(4) (1) a = b * c (2) x = y b (3) q = a / y (4) y = x + b 31 Scott B. Baden / CSE 160 /Wi '16

  32. Where do WAR hazardsappear? A. (2) B. (3) C. (4) D. (2) and(3) E. (2), (3) and (4) (1) a = b * c (2) x = y b (3) q = a / y (4) y = x + b 32 Scott B. Baden / CSE 160 /Wi '16

  33. Where do WAR hazardsappear? A. (2) B. (3) C. (4) D. (2) and(3) E. (2), (3) and (4) (1) a = b * c (2) x = y b (3) q = a / y (4) y = x + b 33 Scott B. Baden / CSE 160 /Wi '16

  34. Hazards Instruction Fetch WAR (Write after read) RAW (Read after write) WAW (Write afterwrite) Issue & Resolve Scoreboard FU op fetch op fetch ex ex a = b*c x= y b q = a / y y = x +b 34 Scott B. Baden / CSE 160 /Wi '16

  35. Scoreboard controls the pipeline Scoreboard / ControlUnit Pre-execution buffers Post-execution buffers Read operands Execution unit 1 Instruction Fetch Instruction Issue Write results Read operands Execution unit 2 Pre-issue buffer WAW WAR Instruction Decode RAW MikeFrank 35 Scott B. Baden / CSE 160 /Wi '16

  36. What are therequirements? Responsibility for instruction issue and execution, including hazard detection Multiple instructions must be in the EX stage simultaneously either through pipelining or multiple functional units Our processor (DLX) has: 2 multipliers, 1 divider, 1 integer unit (memory, branch, integer arithmetic) 36 Scott B. Baden / CSE 160 /Wi '16

  37. How does itwork? As each instruction passes through the scoreboard, construct a description of the data dependencies (Issue) Scoreboard determines when the instruction can read operands and begin execution If the instruction can t begin execution, the scoreboard keeps a record, and it listens for one the instruction can execute Also controls when an instruction may write its result All hazard detection is centralized 37 Scott B. Baden / CSE 160 /Wi '16

  38. Multiple execution pipelines in DLX with scoreboarding A centralized bookkeeping table: tracks instructions +registersthey depend or modify Statusof result registers(who is going to write to a givenregister) Statusof the functionalunits Registers Databuses FP mult FP mult FPdivide Unit Latency FPadd Integer 0 Memory 1 Integer unit FPAdd 2 FPMultiply 10 Scoreboard FPDiv 40 Control/status Control/status 38 Scott B. Baden / CSE 160 /Wi '16

  39. SandyBridge 39 Scott B. Baden / CSE 160 /Wi '16

  40. LoopUnrolling Common loop optimization strategy Duplicate the body of the loop for (int i=0; i < n ; i++4){ z[i+0] = x[i+0] +y[i+0]; z[i+1] = x[i+1] +y[i+1]; z[i+2] = x[i+2] +y[i+2]; z[i+3] = x[i+3] +y[i+3]; } for (int i=0; i < n ;i++) z[i] = x[i] +y[i]; Plays well with instruction level parallelism Register utilization, instruction scheduling Not always advantageous. Why? 40 Scott B. Baden / CSE 160 /Wi '16

Related


More Related Content