Dynamic Schedule Migration for Heterogeneous Cores

1 / 20

Embed Share

"Explore DynaMOS, a system designed for dynamic schedule migration among heterogeneous cores, optimizing performance while reducing wasteful work on expensive out-of-order hardware. With a focus on hardware heterogeneity and fine-grained architectures, DynaMOS achieves near out-of-order performance on in-order hardware by leveraging program traces and memoization. Discover how this approach can potentially execute 80% of applications on in-order cores, reaching 95% of out-of-order core performance." (Characters: 499)

mall878 Follow

Uploaded on Jul 08, 2025 | 1 Views

Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.

Download Presentation

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.

E N D

Presentation Transcript

DynaMOS: Dynamic Schedule Migration for Heterogeneous Cores Shruti Padmanabha, Andrew Lukefahr, Reetuparna Das, Scott Mahlke Micro-48, Waikiki, Hawaii July 8, 2025 University of Michigan Electrical Engineering and Computer Science

2-wide Out-of-order (OoO) Execution 2-wide In-order (InO) Execution Program Order Dependency Graph 4 5 6 1 2 3 1 2 3 1 2 3 1 4 2 5 4 3 6 5 4 5 6 6 Reordering instructions 2x performance! Reordering hardware 6x power!

Redundancy on OoO Create optimal reordered schedule! Program Traces OoO Core Redundantly Code repeats! 90% probability of creating similar schedules for 70% of traces!

Objective Expose and eliminate wasteful work on the expensive OoO hardware Without significantly hurting performance

Background: Heterogeneity In Hardware Many hardware designs of varying capabilities on the same chip OoOs, In-orders, Accelerators, FPGAs Most efficient hardware chosen for application ARM s big.LITTLE, Nvidia s Tegra-3, Intel Xeon+FPGA, AMD Fusion

Background: Fine-grained Heterogeneous Architectures OoO Backend RF Shared Frontend L2$ L1$ Controller RF InO Backend *Composite Cores: Pushing Heterogeneity into a Core, Lukefahr et al, Micro 2012 Minimize transfer overhead Allow application migration at the granularity of 100s of instructions

DynaMOS: Dynamic Schedule Migration for Heterogeneous Cores OoO Backend Program Traces Trace $ Memoize! InO Backend Achieve near OoO performance with near InO hardware!

Motivation - Oracle No Memoization With Memoization 100% % Execution on InO 80% 60% DynaMOS can potentially execute 80% of the application on an InO core, to achieve 95% of an OoO core s performance 40% 20% 0% Performance loss capped at 5% Memoization works for regular benchmarks with predictable Benchmarks with unpredictable control/data flow are not memoizable control/data flow

DynaMOS: Challenges 1 Detect profitable traces to memoize Intelligent trace- based predictor Determine a trace boundary Find repeatability in schedules Determine profitability of memoizing a trace OoO Backend L1 I$ 1 Trace generation & Selection InO Trace $ Backend

DynaMOS: Challenges 1 Detect profitable traces to memoize Intelligent trace- based predictor Guarantee correct execution of reordered schedule on InO OinO mode 2 OoO Backend L1 I$ 1 Trace generation & Selection InO Trace $ Backend 2 OinO HW

Designing the OinO Mode 2 Correctness Factor OoO OinO Memory disambiguation detection Load Store Queue Specialized LSQ a False register dependencies Register renaming 2 level renaming b Divergence from predicted behavior or interrupts Reorder buffer and Register Alias Table Atomic commit PRF LSQ Trace Commit Fetch Decode Back-end Trace complete? InO Core *Modifications for OinO mode are shaded in blue

Handling False Dependencies a True After Reordering on OoO Original Assembly Dependency (RAW)! OoO reorders independent instructions Seq # Seq # 1(HEAD) ldr r2, [r2] 1(HEAD) ldr r2, [r2] False 3 ldr r2, [r3] 2 add r5, r2, #4 Dependency (WAW)! 2 add r5, r2, #4 3 ldr r2, [r3] Level 1: Intra trace dependencies Done on the OoO Physical Register File Overhead: Bigger PRF on InO! PR 2 Memoized Trace on OinO 2.0 2.1 Access indexed physical location ldr r2.1, [r2.0] Constraint: Only 4 PR per Arch Register! ldr r2.2, [r3.0] 2.2 add r5.1, r2.1, #4 2.3

Handling False Dependencies a Level 2: Inter trace dependencies Done by the OinO Rotating Memoized Trace on OinO Physical Register File IT PR 2 PR 2 ldr r2.1, [r2.0] Committed Offset Ptr 2.0 2.2 I ldr r2.2, [r3.0] 2.1 2.3 add r5.1, r2.1, #4 2.2 2.0 ldr r2.1, [r2.0] 2.3 2.1 II ldr r2.2, [r3.0] add r5.1, r2.1, #4

Handling memory disambiguation b (iii) Trace Memoized In Tr$ (i) Trace Selected by OoO Memoized Trace (Mem) Prog Order (Mem) Seq # Seq # 4 Ld 3 Ld 3 0 Str 1 (ii) Encode Mem Position With Trace 2 Str 2 Str 2 1 Ld 1 1 Ld 1 2 Str 2 3 Ld 2 3 Ld 2 0 Str 1 4 Ld 3 (iv) Allocate OinO LSQ entries in Seq # Order Overhead: LSQ structure and Seq# table per trace Load/Store Queue 4 3 2 1 0 (v) Check Younger Mem Ops for Aliasing

Evaluation Methodology Architectural Feature Parameters Big Core 3 wide O3 @ 2GHz 12 stage pipeline 128 ROB Entries 128 entry PRF, 32 entry LSQ Little Core 3 wide InOrder @ 2GHz 8 stage pipeline 128 entry PRF, 32 entry LSQ Memory System 32 KB L1 i/d cache, 2 cycle access 4KB Trace cache, 1 cycle access 1MB L2 cache, 15 cycle access 1GB Main Mem, 80 cycle access Simulator Gem5 Energy Model Overheads: Hardware overheads induce 8% increase in the power of InO 4kB Trace $ adds 10% to the leakage energy McPAT Benchmarks SPEC 2k6, compiled for ARM ISA Simulated for a total of 108 simpoints of 300M instructions each 15

Utilization of Little Bar 2 = DynaMOS (With Memoization) %OinO Bar 1 = Composite Core (No Memoization) %InO %OoO 100% % Execution On InO 90% 80% 70% 60% 50% 40% 30% 20% 10% 0% average astar bzip2 Performance loss capped at 5% Worst-case DynaMOS performance ~ Composite Cores Little executes both low-performance traces and traces with high memoizability

Energy Savings DynaMOS 60% Energy savings relating to OoO 50% 40% 30% 20% 10% 0%

Additional Results in the Paper Sensitivity studies to different microarchitecture configurations of OoO and InO Equal widths in both cores allows the simplest memoization, leading to best results Comparison studies to Loop Caches and Execution Caches Switching over the InO core saves the most energy Sensitivity studies to the size of the Trace Cache and various other constraints imposed in OinO

Summary Out-of-order cores create similar schedules for repeating code Wasteful use of expensive resources DynaMOS: Exploit fine-grained heterogeneity to allow sharing of OoO schedules with InO cores Allows 32% energy savings over only an OoO core with a 5% performance loss More details and comparison to related work in the paper

DynaMOS: Dynamic Schedule Migration for Heterogeneous Cores MAHALO! QUESTIONS? Shruti Padmanabha, Andrew Lukefahr, Reetuparna Das, Scott Mahlke Micro-48, Waikiki, Hawaii University of Michigan Electrical Engineering and Computer Science

Dynamic Schedule Migration for Heterogeneous Cores

Download Presentation

Presentation Transcript

Related

More Related Content