Dynamic Scheduling in Advanced Computer Architecture

cs5100 advanced computer architecture n.w
1 / 44
Embed
Share

Explore the basic concepts of dynamic scheduling and the Tomasulo algorithm in computer architecture. Overcome data hazards, maximize ILP, and learn about the limitations of compilers in optimizing code generation for different processors. Discover the importance of out-of-order execution for efficient processor performance.

  • Computer Architecture
  • Dynamic Scheduling
  • Tomasulo Algorithm
  • ILP
  • Compiler Limitations

Uploaded on | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.

E N D

Presentation Transcript


  1. CS5100 Advanced Computer Architecture Dynamic Scheduling Prof. Chung-Ta King Department of Computer Science National Tsing Hua University, Taiwan (Slides are from textbook, Prof. Hsien-Hsin Lee, Prof. Yasun Hsu) National Tsing Hua University

  2. About This Lecture Goal: To understand the basic concepts of dynamic scheduling and the Tomasulo algorithm Outline: Overcoming data hazards with dynamic scheduling (Sec. 3.4) Dynamic scheduling: examples and algorithm (Sec. 3.5) 1 1 National Tsing Hua University

  3. Maximum ILP What make a sequence of code to have the highest ILP? What make a sequence of code to have the highest ILP? Data dependence Control dependence Can a compiler give you such code? 2 National Tsing Hua University

  4. Compiler Has Its Limitations Even though compiler can see a lot of ILP, it still outputs sequential code for conventional processors (next page) Many ILP lost in code generation Code targeting one processor may not be optimized for another with a different microarchitecture A lot of information unavailable at compile time, e.g. branch direction and target, pointer addresses, 3 National Tsing Hua University

  5. Compiler Sees Data Flow Graph (DFG) i1: r2 = 4(r22) i2: r10 = 4(r25) i3: r10 = r2 + r10 i4: 4(r26) = r10 i5: r14 = 8(r27) i6: r6 = (r22) i7: r5 = (r23) i8: r5 = r6 r5 i9: r4 = r14 * r5 i10: r15 = 12(r27) i11: r7 = 4(r22) i12: r8 = 4(r23) i13: r8 = r7 r8 i14: r8 = r15* r8 i15: r8 = r4 r8 i16: (r28) = r8 i1 i2 i5 i6 i7 i10 i11 i12 i3 i8 i13 Fired i4 i9 i14 i15 Code generation Data Flow Graph (DFG) (Data Dependency Graph) i16 4 National Tsing Hua University

  6. If Processor Just Follows the Code In-order execution: Instructions executed in the order defined by the program/compiler simple hardware A long (perhaps unexpected) latency may block ready instructions from executing: DIVD F0,F2,F4 ; multicycle instruction ADDD F10,F0,F8 ; stalled SUBD F12,F8,F14 ; independent of above It happens just because the compiler decides to put SUBD behind ADDD at code generation Need to be out-of-order execution: Processor uncovers DFG from the code sequence itself 5 National Tsing Hua University

  7. Out-of-Order Execution i1: r2 = 4(r22) i2: r10 = 4(r25) i3: r10 = r2 + r10 i4: 4(r26) = r10 i5: r14 = 8(r27) i6: r6 = (r22) i7: r5 = (r23) i8: r5 = r6 r5 i9: r4 = r14 * r5 i10: r15 = 12(r27) i11: r7 = 4(r22) i12: r8 = 4(r23) i13: r8 = r7 r8 i14: r8 = r15* r8 i15: r8 = r4 r8 i16: (r28) = r8 i1 i2 i5 i6 i7 i10 i11 i12 i3 i8 i13 i4 i9 i14 i15 i16 6 National Tsing Hua University

  8. Dynamic Scheduling Exploit ILP at run-time Execute instructions out-of-order by a restricted data flow execution model (still use PC!) Hardware will Maintain true dependency (data flow manner) Maintain exception behavior Find ILP within an instruction window (pool) Need an accurate branch predictor Hardware can also eliminate name dependency by renaming 7 National Tsing Hua University

  9. Dynamic Scheduling Pros Cope with variable latency at run time, e.g. cache misses Compiler does not need to have knowledge of microarchitecture: Avoid recompiling old binaries Avoid bottleneck of small named register sets Handle cases where dependency is unknown at compile- time Cons Hardware complexity (main argument from the VLIW/EPIC camp) Complicates exceptions 8 National Tsing Hua University

  10. Out-of-Order (OOO) Execution OOO execution out-of-order completion Begin execution as soon as operands are available Complete execution as soon as output operand generated OOO execution out-of-order retirement (commitment, write result) Machine state is not changed until instruction commits No (speculative) instructions are allowed to retire until they are confirmed to be on the right path Fetch, decode, issue (i.e. front-end) are still done in the program order 9 National Tsing Hua University

  11. Dynamic Scheduling by Tomasulo Algo. Two techniques in one: Dynamic scheduling for out-of-order execution Register renaming to avoid WAR and WAW hazards Developed by Robert Tomasulo at IBM in 1967 First implemented in the IBM System/360 Model 91 s floating point unit IBM System/360 introduced 8-bit = 1 byte 32-bit = 1 word Byte-addressable memory Differentiate an architecture from an implementation 10 National Tsing Hua University

  12. Problems with IBM 360/91 ISA 2 register specifiers/instruction in IBM 360 e.g. MULTD F2, F0 Make WAW and WAR much worse 4 FP registers in IBM 360 ISA Instructions can only see and use 4 FP registers architecture visible registers Make compiler difficult to allocate registers, e.g. need to reuse registers, creating name dependences Memory-to-register and FP operations Long and variable instruction execution time // F2 F2 F0 11 National Tsing Hua University

  13. Motivation for Tomasulo Algorithm Cope with only 4 FP registers Use more internal, architecture invisible registers (virtual registers) to break name dependences reg. renaming High FP performance without specialized compilers Hardware detects data dependences via registers used Hardware schedules instruction execution following DFG Overcome long memory and FP delays OOO execution to allow instruction execution overlapped Support execution of multiple iterations of a loop Even if loop branches can be predicted perfectly, still need to handle name dependence across iterations 12 National Tsing Hua University

  14. Key Features of Tomasulo Algorithm Each functional unit is associated with a number of reservation stations (RS) A RS controls the execution of one instruction that is going to use that FU, by tracking availability of its operands Contains the instruction, buffered operand values (when available), RS # of instruction providing the operand Instruction register specifiers are renamed with the RS tag register renaming RS copies operands to its buffer when they are available buffer+id: serves as virtual registers for reg. renaming When all operands are ready, instruction is fired to FU Hazard detection and interlocks are distributed to FUs DFG 13 National Tsing Hua University

  15. Key Features of Tomasulo Algorithm Results of FUs broadcasted directly to RSs over Common Data Bus (CDB), not through registers RS fetches and buffers an operand thru CDB as soon as it becomes available (not necessarily through register file) similar to internal forwarding/bypass Do not change machine state Register status table to track last RS to write the reg. Due to in-order issue, there is no WAW hazard Structural hazards checked at issue stage If RSs are available, then allocate one RS to issue Load and store units treated as FU with RS Integer instructions can past branches, via prediction 14 National Tsing Hua University

  16. IBM 360/91 FPU w/ Tomasulo Algorithm FP Registers (FLR) architecture visible From Mem FP operation stack (FLOS) 6 5 4 3 2 1 FP Load Buffers (FLB) Store Data Buffers (SDB) 3 2 1 2 1 Reservation Stations To Mem FP Adder FP Mult/Div Common Data Bus (CDB) 15 National Tsing Hua University

  17. 3 Stages of Tomasulo Algorithm Issue: get an instruction from instruction queue Issue if there is an empty RS Send operands to RS if in registers register renaming, structural hazard detect Execute: If operands unavailable, monitor CDB (common data bus) else, place operand into RS internal forwarding When all operands are ready, execute the instruction Loads and store maintained in program order through effective address No instruction allowed to execute until all branches that proceed it in program order have completed 16 National Tsing Hua University

  18. 3 Stages of Tomasulo Algorithm Write result: Write result on CDB, then to register (change machine state) No checking for WAW and WAR (eliminated with renaming); no need for dependent instructions to wait at register file (they wait at RS via internal forward through CDB) Load/store is treated as a functional unit Stores must wait until address and value are received 17 National Tsing Hua University

  19. Structure of Reservation Stations Each reservation station has 6 fields: Op: the operation to perform in the unit Vj, Vk: value of the source operands Qj, Qk: tag of the RS to produce source operands Busy: the RS and associated FU being busy Each register and store buffer has one field: Qi: tag of the RS containing the operation that will write to it; blank meaning register value available Load and store buffers each require a busy field Store buffer also has a field V, which holds the value to be stored to the memory 18 National Tsing Hua University

  20. Tomasulo Algorithm Loop Example Loop: LD Assume multiply takes 4 cycles Assume 1st load takes 8 cycles (cache miss?), 2nd load 4 cycles Assume branches are taken More WAW and WAR across iterations Need dynamic memory disambiguation to reorder load/store Check addresses in store buffer to detect dependences through memory F0 0 F0 0 R1 Loop R1 F2 R1 #8 p. 179 of textbook (5/e): Loop: L.D F0, 0(R1) MUL.D F4, F0, F2 S.D F4, 0(R1) DADDIU R1, R1, -8 BNE R1, R2, Loop MULTD F4 SD SUBI R1 BNEZ R1 F4 LD F0, 0(R1) MULD F4, F0, F2 SD F4, 0(R1) LD F0, 0(R1) MULD F4, F0, F2 SD F4, 0(R1) 19 National Tsing Hua University

  21. Loop Example Cycle 0 Instruction status Instruction LD F0 MULTD F4 SD F4 LD F0 MULTD F4 SD F4 Reservation Stations Time Name ExecutionWrite complete Result j k iteration Issue Busy Address No No No No No No 0 R1 F0 F2 0 R1 0 R1 F0 F2 0 R1 1 1 1 2 2 2 Load1 Load2 Load3 Store1 Store2 Store3 Qi S1 Vj S2 Vk RS for j RS for k Qj Busy Op No No No No No Qk Code: LD MULTD F4 SD SUBI R1 BNEZ R1 0 Add1 0 Add2 0 Add3 0 Mult1 0 Mult2 F0 0 R1 F0 F2 0 R1 R1 #8 Loop F4 Register result status Clock 0 F0 F2 F4 F6 F8 F10 F12... F30 R1 80 Qi Value of register used for address and iteration control, if branches are predicted to be taken 20 National Tsing Hua University

  22. Loop Example Cycle 1 Instruction status Instruction LD F0 MULTD F4 SD F4 LD F0 MULTD F4 SD F4 Reservation Stations Time Name ExecutionWrite complete Result j k iteration Issue Busy Address Yes No No No No No 0 R1 F0 F2 0 R1 0 R1 F0 F2 0 R1 1 1 1 2 2 2 1 Load1 Load2 Load3 Store1 Store2 Store3 80 LD 80: cache miss Qi S1 Vj S2 Vk RS for j RS for k Qj Busy Op No No No No No Qk Code: LD MULTD F4 SD SUBI R1 BNEZ R1 0 Add1 0 Add2 0 Add3 0 Mult1 0 Mult2 F0 0 R1 F0 F2 0 R1 R1 #8 Loop F4 Register result status Clock 1 F0 F2 F4 F6 F8 F10 F12... F30 R1 80 Qi Load1 21 National Tsing Hua University

  23. Loop Example Cycle 2 Instruction status Instruction LD F0 MULTD F4 SD F4 LD F0 MULTD F4 SD F4 Reservation Stations Time Name ExecutionWrite complete Result j k iteration Issue Busy Address Yes No No No No No 0 R1 F0 F2 0 R1 0 R1 F0 F2 0 R1 1 1 1 2 2 2 1 2 Load1 Load2 Load3 Store1 Store2 Store3 80 Qi S1 Vj S2 Vk RS for j RS for k Qj Busy Op No No No Yes No Qk Code: LD MULTD F4 SD SUBI R1 BNEZ R1 0 Add1 0 Add2 0 Add3 0 Mult1 0 Mult2 F0 0 R1 F0 F2 0 R1 R1 #8 Loop F4 MULTD R(F2) Load1 Register result status Clock 2 F0 F2 F4 Mult1 F6 F8 F10 F12... F30 R1 80 Original: F2 is not freed until LD and MULT both finished Tag Load1 helps track flow dependence Qi Load1 22 National Tsing Hua University

  24. Loop Example Cycle 3 Instruction status Instruction LD F0 MULTD F4 SD F4 LD F0 MULTD F4 SD F4 Reservation Stations Time Name ExecutionWrite complete Result j k iteration Issue Busy Address Yes No No Yes No No 0 R1 F0 F2 0 R1 0 R1 F0 F2 0 R1 1 1 1 2 2 2 1 2 3 Load1 Load2 Load3 Store1 Store2 Store3 80 Qi Mult1 80 S1 Vj S2 Vk RS for j RS for k Qj Busy Op No No No Yes No Qk Code: LD MULTD F4 SD SUBI R1 BNEZ R1 0 Add1 0 Add2 0 Add3 0 Mult1 0 Mult2 F0 0 R1 F0 F2 0 R1 R1 #8 Loop F4 MULTD R(F2) Load1 Register result status Clock 3 F0 F2 F4 Mult1 F6 F8 F10 F12... F30 R1 80 Qi Load1 23 National Tsing Hua University

  25. Loop Example Cycle 4 Instruction status Instruction LD F0 MULTD F4 SD F4 LD F0 MULTD F4 SD F4 Reservation Stations Time Name ExecutionWrite complete Result j k iteration Issue Busy Address Yes No No Yes No No 0 R1 F0 F2 0 R1 0 R1 F0 F2 0 R1 1 1 1 2 2 2 1 2 3 Load1 Load2 Load3 Store1 Store2 Store3 80 Qi Mult1 80 S1 Vj S2 Vk RS for j RS for k Qj Busy Op No No No Yes No Qk Code: LD MULTD F4 SD SUBI R1 BNEZ R1 0 Add1 0 Add2 0 Add3 0 Mult1 0 Mult2 F0 0 R1 F0 F2 0 R1 R1 #8 Loop F4 MULTD R(F2) Load1 Register result status Clock 4 F0 F2 F4 Mult1 F6 F8 F10 F12... F30 R1 72 Qi Load1 Dispatching SUBI instruction (not in FP queue) to INT FU 24 National Tsing Hua University

  26. Loop Example Cycle 5 Instruction status Instruction LD F0 MULTD F4 SD F4 LD F0 MULTD F4 SD F4 Reservation Stations Time Name ExecutionWrite complete Result j k iteration Issue Busy Address Yes No No Yes No No 0 R1 F0 F2 0 R1 0 R1 F0 F2 0 R1 1 1 1 2 2 2 1 2 3 Load1 Load2 Load3 Store1 Store2 Store3 80 Qi Mult1 80 S1 Vj S2 Vk RS for j RS for k Qj Busy Op No No No Yes No Qk Code: LD MULTD F4 SD SUBI R1 BNEZ R1 0 Add1 0 Add2 0 Add3 0 Mult1 0 Mult2 F0 0 R1 F0 F2 0 R1 R1 #8 Loop F4 MULTD R(F2) Load1 Register result status Clock 5 F0 F2 F4 Mult1 F6 F8 F10 F12... F30 R1 72 Qi Load1 BNEZ (not in FP queue) with branch prediction 25 National Tsing Hua University

  27. Loop Example Cycle 6 Instruction status Instruction LD F0 MULTD F4 SD F4 LD F0 MULTD F4 SD F4 Reservation Stations Time Name ExecutionWrite complete Result j k iteration Issue Busy Address Yes Yes No Yes No No 0 R1 F0 F2 0 R1 0 R1 F0 F2 0 R1 1 1 1 2 2 2 1 2 3 6 Load1 Load2 Load3 Store1 Store2 Store3 80 72 Qi Mult1 80 S1 Vj S2 Vk RS for j RS for k Qj Busy Op No No No Yes No Qk Code: LD MULTD F4 SD SUBI R1 BNEZ R1 0 Add1 0 Add2 0 Add3 0 Mult1 0 Mult2 F0 0 R1 F0 F2 0 R1 R1 #8 Loop F4 WAW MULTD R(F2) Load1 Register result status Clock 6 F0 F2 F4 Mult1 F6 F8 F10 F12... F30 R1 72 Qi Load2 F0 never sees Load1 result; WAW eliminated! Why does it not cause any problem? LD can be issued after checking store buffer to ensure no dependence 26 National Tsing Hua University

  28. Loop Example Cycle 7 Instruction status Instruction LD F0 MULTD F4 SD F4 LD F0 MULTD F4 SD F4 Reservation Stations Time Name ExecutionWrite complete Result j k iteration Issue Busy Address Yes Yes No Yes No No 0 R1 F0 F2 0 R1 0 R1 F0 F2 0 R1 1 1 1 2 2 2 1 2 3 6 7 Load1 Load2 Load3 Store1 Store2 Store3 80 72 Qi Mult1 80 S1 Vj S2 Vk RS for j RS for k Qj Busy Op No No No Yes Yes Qk Code: LD MULTD F4 SD SUBI R1 BNEZ R1 WAR across iteration? 0 Add1 0 Add2 0 Add3 0 Mult1 0 Mult2 F0 0 R1 F0 F2 0 R1 R1 #8 Loop F4 MULTD MULTD R(F2) R(F2) Load1 Load2 Register result status Clock 7 F0 F2 F4 Mult2 F6 F8 F10 F12... F30 R1 72 Qi Load2 1st & 2nd iteration overlapped; Why does SD not worry about F4 being destroyed? 27 National Tsing Hua University

  29. Loop Example Cycle 8 Instruction status Instruction LD F0 MULTD F4 SD F4 LD F0 MULTD F4 SD F4 Reservation Stations Time Name ExecutionWrite complete Result j k iteration Issue Busy Address Yes Yes No Yes Yes No 0 R1 F0 F2 0 R1 0 R1 F0 F2 0 R1 1 1 1 2 2 2 1 2 3 6 7 8 Load1 Load2 Load3 Store1 Store2 Store3 80 72 Qi Mult1 Mult2 80 72 S1 Vj S2 Vk RS for j RS for k Qj Busy Op No No No Yes Yes Qk Code: LD MULTD F4 SD SUBI R1 BNEZ R1 0 Add1 0 Add2 0 Add3 0 Mult1 0 Mult2 F0 0 R1 F0 F2 0 R1 R1 #8 Loop F4 MULTD MULTD R(F2) R(F2) Load1 Load2 Register result status Clock 8 F0 F2 F4 Mult2 F6 F8 F10 F12... F30 R1 72 Qi Load2 Does SD need to check load buffer to ensure no dependence? 28 National Tsing Hua University

  30. Loop Example Cycle 9 Instruction status Instruction LD F0 MULTD F4 SD F4 LD F0 MULTD F4 SD F4 Reservation Stations Time Name ExecutionWrite complete Result 9 j k iteration Issue Busy Address Yes Yes No Yes Yes No 0 R1 F0 F2 0 R1 0 R1 F0 F2 0 R1 1 1 1 2 2 2 1 2 3 6 7 8 Load1 Load2 Load3 Store1 Store2 Store3 80 72 Qi Mult1 Mult2 80 72 S1 Vj S2 Vk RS for j RS for k Qj Busy Op No No No Yes Yes Qk Code: LD MULTD F4 SD SUBI R1 BNEZ R1 0 Add1 0 Add2 0 Add3 0 Mult1 0 Mult2 F0 0 R1 F0 F2 0 R1 R1 #8 Loop F4 MULTD MULTD R(F2) R(F2) Load1 Load2 Register result status Clock 9 F0 F2 F4 Mult2 F6 F8 F10 F12... F30 R1 64 Qi Load2 Load1 completing: what is waiting for it? Issuing 2nd SUBI 29 National Tsing Hua University

  31. Loop Example Cycle 10 Instruction status Instruction LD F0 MULTD F4 SD F4 LD F0 MULTD F4 SD F4 Reservation Stations Time Name ExecutionWrite complete Result 9 j k iteration Issue Busy Address No Yes No Yes Yes No 0 R1 F0 F2 0 R1 0 R1 F0 F2 0 R1 1 1 1 2 2 2 1 2 3 6 7 8 10 Load1 Load2 Load3 Store1 Store2 Store3 72 Qi Mult1 Mult2 10 80 72 S1 Vj S2 Vk RS for j RS for k Qj Busy Op No No No Yes Yes Qk Code: LD MULTD F4 SD SUBI R1 BNEZ R1 0 Add1 0 Add2 0 Add3 4 Mult1 0 Mult2 F0 0 R1 F0 F2 0 R1 R1 #8 Loop Got value from CDB F4 MULTD MULTD M(80) R(F2) R(F2) Load2 Register result status Clock 10 F0 F2 F4 Mult2 F6 F8 F10 F12... F30 R1 64 Qi Load2 Load2 completing: what is waiting for it? Issuing 2nd BNEZ 30 National Tsing Hua University

  32. Loop Example Cycle 11 Instruction status Instruction LD F0 MULTD F4 SD F4 LD F0 MULTD F4 SD F4 Reservation Stations Time Name ExecutionWrite complete Result 9 j k iteration Issue Busy Address No No Yes Yes Yes No 0 R1 F0 F2 0 R1 0 R1 F0 F2 0 R1 1 1 1 2 2 2 1 2 3 6 7 8 10 Load1 Load2 Load3 Store1 Store2 Store3 64 80 72 Qi Mult1 Mult2 10 11 S1 Vj S2 Vk RS for j RS for k Qj Busy Op No No No Yes Yes Qk Code: LD MULTD F4 SD SUBI R1 BNEZ R1 0 Add1 0 Add2 0 Add3 3 Mult1 4 Mult2 F0 0 R1 F0 F2 0 R1 R1 #8 Loop F4 MULTD MULTD M(80) R(F2) M(72) R(F2) Register result status Clock 11 F0 F2 F4 Mult2 F6 F8 F10 F12... F30 R1 64 Qi Load3 Next load in 3rd iteration after checking store buffer 31 National Tsing Hua University

  33. Loop Example Cycle 12 Instruction status Instruction LD F0 MULTD F4 SD F4 LD F0 MULTD F4 SD F4 Reservation Stations Time Name ExecutionWrite complete Result 9 j k iteration Issue Busy Address No No Yes Yes Yes No 0 R1 F0 F2 0 R1 0 R1 F0 F2 0 R1 1 1 1 2 2 2 1 2 3 6 7 8 10 Load1 Load2 Load3 Store1 Store2 Store3 64 80 72 Qi Mult1 Mult2 10 11 S1 Vj S2 Vk RS for j RS for k Qj Busy Op No No No Yes Yes Qk Code: LD MULTD F4 SD SUBI R1 BNEZ R1 0 Add1 0 Add2 0 Add3 2 Mult1 3 Mult2 F0 0 R1 F0 F2 0 R1 R1 #8 Loop F4 MULTD MULTD M(80) R(F2) M(72) R(F2) stall Register result status Clock 12 F0 F2 F4 Mult2 F6 F8 F10 F12... F30 R1 64 Qi Load3 Why not issue third multiply? 32 National Tsing Hua University

  34. Loop Example Cycle 13 Instruction status Instruction LD F0 MULTD F4 SD F4 LD F0 MULTD F4 SD F4 Reservation Stations Time Name ExecutionWrite complete Result 9 j k iteration Issue Busy Address No No Yes Yes Yes No 0 R1 F0 F2 0 R1 0 R1 F0 F2 0 R1 1 1 1 2 2 2 1 2 3 6 7 8 10 Load1 Load2 Load3 Store1 Store2 Store3 64 80 72 Qi Mult1 Mult2 10 11 S1 Vj S2 Vk RS for j RS for k Qj Busy Op No No No Yes Yes Qk Code: LD MULTD F4 SD SUBI R1 BNEZ R1 0 Add1 0 Add2 0 Add3 1 Mult1 2 Mult2 F0 0 R1 F0 F2 0 R1 R1 #8 Loop F4 MULTD MULTD M(80) R(F2) M(72) R(F2) Register result status Clock 13 F0 F2 F4 Mult2 F6 F8 F10 F12... F30 R1 64 Qi Load3 Why not issue third store? 33 National Tsing Hua University

  35. Loop Example Cycle 14 Instruction status Instruction LD F0 MULTD F4 SD F4 LD F0 MULTD F4 SD F4 Reservation Stations Time Name ExecutionWrite complete Result 9 14 j k iteration Issue Busy Address No No Yes Yes Yes No 0 R1 F0 F2 0 R1 0 R1 F0 F2 0 R1 1 1 1 2 2 2 1 2 3 6 7 8 10 Load1 Load2 Load3 Store1 Store2 Store3 64 80 72 Qi Mult1 Mult2 10 11 S1 Vj S2 Vk RS for j RS for k Qj Busy Op No No No Yes Yes Qk Code: LD MULTD F4 SD SUBI R1 BNEZ R1 0 Add1 0 Add2 0 Add3 0 Mult1 1 Mult2 F0 0 R1 F0 F2 0 R1 R1 #8 Loop F4 MULTD MULTD M(80) R(F2) M(72) R(F2) Register result status Clock 14 F0 F2 F4 Mult2 F6 F8 F10 F12... F30 R1 64 Qi Load3 Mult1 completing: what is waiting for it? 34 National Tsing Hua University

  36. Loop Example Cycle 15 Instruction status Instruction LD F0 MULTD F4 SD F4 LD F0 MULTD F4 SD F4 Reservation Stations Time Name ExecutionWrite complete Result 9 14 j k iteration Issue Busy Address No No Yes Yes Yes No 0 R1 F0 F2 0 R1 0 R1 F0 F2 0 R1 1 1 1 2 2 2 1 2 3 6 7 8 10 15 Load1 Load2 Load3 Store1 Store2 Store3 64 80 72 Qi M(80)*R(F2) Mult2 10 15 11 S1 Vj S2 Vk RS for j RS for k Qj Busy Op No No No No Yes Qk Code: LD MULTD F4 SD SUBI R1 BNEZ R1 0 Add1 0 Add2 0 Add3 0 Mult1 0 Mult2 F0 0 R1 F0 F2 0 R1 R1 #8 Loop via CDB F4 MULTD M(72) R(F2) Register result status Clock 15 F0 F2 F4 Mult2 F6 F8 F10 F12... F30 R1 64 Qi Load3 Mult2 completing; what is waiting for it? 35 National Tsing Hua University

  37. Loop Example Cycle 16 Instruction status Instruction LD F0 MULTD F4 SD F4 LD F0 MULTD F4 SD F4 Reservation Stations Time Name ExecutionWrite complete Result 9 14 j k iteration Issue Busy Address No No Yes Yes Yes No 0 R1 F0 F2 0 R1 0 R1 F0 F2 0 R1 1 1 1 2 2 2 1 2 3 6 7 8 10 15 Load1 Load2 Load3 Store1 Store2 Store3 64 80 72 Qi M(80)*R(F2) M(72)*R(72) 10 15 11 16 S1 Vj S2 Vk RS for j RS for k Qj Busy Op No No No Yes No Qk Code: LD MULTD F4 SD SUBI R1 BNEZ R1 0 Add1 0 Add2 0 Add3 0 Mult1 0 Mult2 F0 0 R1 F0 F2 0 R1 R1 #8 Loop via CDB F4 MULTD R(F2) Load3 (3rd multiply) Register result status Clock 16 F0 F2 F4 Mult1 F6 F8 F10 F12... F30 R1 64 Qi Load3 36 National Tsing Hua University

  38. Loop Example Cycle 17 Instruction status Instruction LD F0 MULTD F4 SD F4 LD F0 MULTD F4 SD F4 Reservation Stations Time Name ExecutionWrite complete Result 9 14 j k iteration Issue Busy Address No No Yes Yes Yes Yes 0 R1 F0 F2 0 R1 0 R1 F0 F2 0 R1 1 1 1 2 2 2 1 2 3 6 7 8 10 15 Load1 Load2 Load3 Store1 Store2 Store3 64 80 72 64 Qi M(80)*R(F2) M(72)*R(72) Mult1 10 15 11 16 S1 Vj S2 Vk RS for j RS for k Qj Busy Op No No No Yes No Qk Code: LD MULTD F4 SD SUBI R1 BNEZ R1 0 Add1 0 Add2 0 Add3 0 Mult1 0 Mult2 F0 0 R1 F0 F2 0 R1 R1 #8 Loop F4 MULTD R(F2) Load3 Register result status Clock 17 F0 F2 F4 Mult1 F6 F8 F10 F12... F30 R1 64 Qi Load3 37 National Tsing Hua University

  39. Loop Example Cycle 18 Instruction status Instruction LD F0 MULTD F4 SD F4 LD F0 MULTD F4 SD F4 Reservation Stations Time Name ExecutionWrite complete Result 9 14 18 10 15 j k iteration Issue Busy Address No No Yes Yes Yes Yes 0 R1 F0 F2 0 R1 0 R1 F0 F2 0 R1 1 1 1 2 2 2 1 2 3 6 7 8 10 15 Load1 Load2 Load3 Store1 Store2 Store3 64 80 72 64 Qi M(80)*R(F2) M(72)*R(72) Mult1 11 16 S1 Vj S2 Vk RS for j RS for k Qj Busy Op No No No Yes No Qk Code: LD MULTD F4 SD SUBI R1 BNEZ R1 0 Add1 0 Add2 0 Add3 0 Mult1 0 Mult2 F0 0 R1 F0 F2 0 R1 R1 #8 Loop F4 MULTD R(F2) Load3 Register result status Clock 18 F0 F2 F4 Mult1 F6 F8 F10 F12... F30 R1 56 Qi Load3 38 National Tsing Hua University

  40. Loop Example Cycle 19 Instruction status Instruction LD F0 MULTD F4 SD F4 LD F0 MULTD F4 SD F4 Reservation Stations Time Name ExecutionWrite complete Result 9 14 18 10 15 19 j k iteration Issue Busy Address No No Yes No Yes Yes 0 R1 F0 F2 0 R1 0 R1 F0 F2 0 R1 1 1 1 2 2 2 1 2 3 6 7 8 10 15 19 11 16 Load1 Load2 Load3 Store1 Store2 Store3 64 Qi 72 64 M(72)*R(72) Mult1 S1 Vj S2 Vk RS for j RS for k Qj Busy Op No No No Yes No Qk Code: LD MULTD F4 SD SUBI R1 BNEZ R1 0 Add1 0 Add2 0 Add3 0 Mult1 0 Mult2 F0 0 R1 F0 F2 0 R1 R1 #8 Loop F4 MULTD R(F2) Load3 Register result status Clock 19 F0 F2 F4 Mult1 F6 F8 F10 F12... F30 R1 56 Qi Load3 39 National Tsing Hua University

  41. In-order issue, OOO execution, completion, commitment Loop Example Cycle 20 Instruction status Instruction LD F0 MULTD F4 SD F4 LD F0 MULTD F4 SD F4 Reservation Stations Time Name ExecutionWrite complete Result 9 14 18 10 15 20 S2 Vk j k iteration Issue Busy Address No No Yes No No Yes 0 R1 F0 F2 0 R1 0 R1 F0 F2 0 R1 1 1 1 2 2 2 1 2 3 6 7 8 10 15 19 11 16 21 20 Load1 Load2 Load3 Store1 Store2 Store3 64 Qi No WAR 64 Mult1 19 S1 Vj RS for j RS for k Qj Busy Op No No No Yes No Qk Code: LD MULTD F4 SD SUBI R1 BNEZ R1 0 Add1 0 Add2 0 Add3 0 Mult1 0 Mult2 F0 0 R1 F0 F2 0 R1 R1 #8 Loop F4 MULTD R(F2) Load3 Register result status Clock 21 20 F0 F2 F4 Mult1 F6 F8 F10 F12... F30 R1 56 Qi Load3 40 National Tsing Hua University

  42. Dependences through Memory in LD/ST 2 step process for both LD & ST: 1st step: Calculate effective address and place into separate L or S buffers in program order 2nd step: Access memory unit and the rest ST can update memory when it reaches write-result stage and whenever data is available Order between ST and LD can be OOO and cause hazard Hazard detection (dynamic mem. disambiguation) LD: Check its address with addresses in store buffers; if match, delay sending LD to load buffer until store is done ST: Same as LD but check both load and store buffers (to avoid that the 2nd store may move ahead of the 1st store if they are to the same address) 41 National Tsing Hua University

  43. Load Bypassing and Load Forwarding Bypassing: when load address does not match addresses of preceding stores, load is allowed to move ahead of these stores in the store buffer Forwarding: If load address matches address of a store, to-be-stored data can be forwarded to load (RAW) If multiple preceding stores in store buffer that alias with the load, must determine which store is the most recent To avoid port contention, an additional read port is needed in the store buffer to forward data to load. The original read port is used to transfer data to data cache 42 National Tsing Hua University

  44. Summary of Tomasulo Algorithm Distributed hazard detect and execute control: Distributed RSs; depends on RS availability not FU CDB broadcasts operands and releases multiple pending instructions (through associative tag matching) Internal forwarding without going through registers Eliminate WAR and WAW no need to check Register renaming through RSs Copy operands into RS when available no WAR Last of successive writes actually write to reg. no WAW Load and store units are treated as FUs Build data flow graph on the fly Complex H/W for control, associative store, BCD 43 National Tsing Hua University

More Related Content