Optimizing Power Efficiency in Many-core Processors with Voltage Scaling

optimizing total power of many core processors n.w

1 / 21

Embed Share

Explore the optimization of total power in many-core processors by considering voltage scaling limits and process variations. Learn about supply voltage scaling, power scaling impacts, process variations, and more to improve performance-power efficiency in multicore processors.

wkee Follow

Uploaded on May 28, 2025 | 0 Views

Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.

Download Presentation

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.

E N D

Presentation Transcript

Optimizing Total Power of Many-core Processors Considering Voltage Scaling Limit and Process Variations Jungseob Lee and Nam Sung Kim October 9, 2009 Department of Electrical and Computer Engineering University of Wisconsin - Madison

Outline Introduction Supply Voltage and Power Scaling Supply Voltage Scaling of Many-Core Processors Power Scaling of Many-Core Processors Impacts of Within-Die(WID) Spatial Process Variations Global Clocking Frequency Island Clocking Conclusions

Multicore processors Parallel Processing Improved throughput of computing systems w/ more cores Throughput is limited by power+thermal constraints w/ all cores running Challenges: How do we Determine # of cores for best performance-power efficiency? Exploit process variations for multicore processors? GPU which has many cores [2] Parallel processing Serial processing [1] [1] Source: http://www.interactivesupercomputing.com/starpexpress/042007/3_Task_Parallel.html [2] Source: NVIDIA

Process variations Types of Process variations Die-to-Die (D2D) Variations Within-Die (WID) Variations A Systematic Vth variation map for a 16-core processor The corresponding Norm Fmax and Pleak map Wafer Scale C2C frequency and leakage power variations due to spatial correlated WID variations become considerable. Courtesy: K. Bowman from Intel

Supply Voltage Scaling1 Supply voltage scaling of many-core processors Throughput w/ certain # of cores at max VDD (thus Fmax) = Throughput w/ more cores at lower VDD (thus Fmax) Potential throughput increase by many cores and lower VDD can reduce power. # of cores 8 Operating freq Lower V than VDD # of cores 4 Operating freq VDD

Supply Voltage Scaling2 Supply voltage scaling of many-core processors M Tcycle(VDD) = M ((1 F) + F/N) Tcycle(V) M Number of operations Tcycle VDD F Cycle time of a processor at supply voltage Nominal supply voltage of base core processor Fraction of operations parallelizable w/o overhead N Relative number of cores V Scaled supply voltage of N x more cores PTM 32nm LP PTM 32nm HP Require higher VDD due to high Vth > 40 %

Dynamic Power Analysis1 Dynamic power scaling Dynamic power of a base many-core processor Pdyn,base = Ceff V2DD Fmax(VDD) Dynamic power of N x more cores than the base processor Pdyn,N= ((1 F) (1+(N 1) K) + F N) Ceff V2 Fmax(V) = k(F, K, N) f(V) (V/VDD)2 Pdyn,base Pdyn,base Ceff VDD Fmax Pdyn,N K Dynamic power of a base core Dynamic power of a base processor Pdyn,base Ceff VDD Fmax Effetive total switching capacitance Effetive total switching capacitance Nominal voltage of the base core Nominal voltage of the base core Maximum operating frequency of the base core Maximum operating frequency of the base proc Dynamic power of N x more cores Fraction of dynamic power of idle cores k(F,K,N) ((1 F) (1+(N 1) K) + F N) f(V) Frequency scaling factor at V; Fmax(V)/Fmax(VDD)

Dynamic Power Analysis2 Dynamic power scaling PTM 32nm LP VDD,min = 0.7V PTM 32nm HP Less VDD scaling Less Pdyn reduction Dotted lines show projected power consumption when no supply limit. Optimal Normalized Pdyn / Relative # of cores VDD,min 0.7 F=0.6 F=0.7 F=0.8 F=0.9 F=1.0 0.66/3 0.60/2 0.52/2 0.75/2 0.45/2 PTM HP 0.75/2 0.66/3 0.54/3 0.41/4 0.34/3 0.6 HP: 25~55% LP: 25~54% 0.75/2 0.66/3 0.54/3 0.41/5 No limit 0.20/8 0.70/2 0.65/2 0.56/3 0.7 0.75/2 0.46/3 PTM LP 0.75/2 0.70/2 0.65/2 0.55/4 0.35/8 0.6 0.75/2 0.70/2 0.65/2 0.55/4 No limit 0.35/8

Leakage Power Analysis1 Leakage power scaling In nanoscale technology, leakage power is significant fraction of total power consumption. Leakage power of a base many-core processor Pleak,base = Ileak(VDD) VDD Leakage power of N x more cores than the base processor Pleak,N= N Ileak(V) V = N l(V) (V/VDD) Pleak,base Pleak,base Pleak,base Dynamic power of a base core Leakage power of a base core Ileak Ileak Total Leakage current of the base processor Total Leakage current of the base processor VDD VDD Nominal voltage of the base core Nominal voltage of the base core Pleak,N Dynamic power of N x more cores l(V) Leakage scaling factor at V

Leakage Power Analysis2 Leakage power scaling PTM 32nm LP PTM 32nm HP But Absolute Pleak is much less than HP Optimal Normalized Pleak / Relative # of cores VDD,min 0.7 F=0.6 F=0.7 F=0.8 F=0.9 F=1.0 0.35/3 0.31/2 0.25/2 0.46/3 0.20/2 HP: 54~80% LP: 33~50% PTM HP 0.46/3 0.35/3 0.27/3 0.21/4 0.16/3 0.6 0.46/3 0.35/3 0.27/3 0.21/4 0.15/5 No limit 0.62/2 0.58/2 0.54/2 0.7 0.67/2 0.50/2 PTM LP 0.67/2 0.62/2 0.58/2 0.54/2 0.50/2 0.6 0.67/2 0.62/2 0.58/2 0.54/2 0.50/2 No limit

Total Power Analysis1 Total power scaling The total power of a base many-core processor is the sum of dynamic and leakage power. Ptot,base = Pdyn,base + Pleak,base =Pdyn,base (1 + LF) The total power of N x more cores than the base processor is the sum of dynamic and leakage power. Ptot,N = Pdyn,N + Pleak,N =Pdyn,base { k(F,K,N) f(V) (V/VDD)2+ N l(V) (V/VDD) LF } Ptot,base Total power of a base core LF Ratio between Pleak and Pdyn ;(Pleak/Pdyn) Ptot,N Total power of N x more cores

Total Power Analysis2 Total power scaling PTM 32nm LP LF 0.2/0.8 PTM 32nm HP LF 0.4/0.6 Optimal Normalized Ptot / Relative # of cores LF VDD,min 0.7 F=0.6 F=0.7 F=0.8 F=0.9 F=1.0 0.53/3 0.48/2 0.41/2 0.64/2 0.35/2 PTM HP 0.4/ 0.6 More VDD scaling 17% more Ptot reduction, but require more on-die memory area 0.64/2 0.53/3 0.43/3 0.33/4 0.27/3 0.6 only HP: 36~65% LP: 26~52% 0.64/2 0.53/3 0.43/3 0.33/5 0.18/8 No limit 0.69/2 0.63/2 0.57/3 0.7 0.74/2 0.48/3 PTM LP 0.2/ 0.8 0.74/2 0.69/2 0.63/2 0.57/3 0.46/5 0.6 0.74/2 0.69/2 0.63/2 0.57/3 0.46/5 No limit

Impacts of WID Variations GC Global Clocking Limits Fmax of a many-core processor to that of slowest core. Previous Pdyn,Nequation still can be used to estimate Pdyn,N Estimation of Pleak,N have to account for each core s leakage variations as follows. = 1 i N Pleak,N= li(V) (V/VDD) Pleak,base li(V) Leakage scaling factor of i-th core; Normalized to Ileak(VDD) Core ID Normalized Fmax, Pleak The corresponding Fmax and Pleak map A Systematic Vth variation map for a 16-core processor

Impacts of WID Variations GC Global Clocking HP Slowest base core HP Fastest base core Much more relative total power reduction because the fastest base core is not power efficient Average total power of 100 die samples / Relative # of cores(N) Base VDD,min 0.7 F=0.6 F=0.7 F=0.8 F=0.9 F=1.0 0.67/2 0.59/2 0.52/2 0.77/2 0.46/2 Slow 0.77/2 0.67/2 0.57/3 0.46/3 0.37/2 0.6 Slow: 23~54% Fast: 77~90% 0.77/2 0.67/2 0.57/3 0.46/4 0.29/8 No limit 0.18/3 0.14/4 0.12/2 0.7 0.23/3 0.10/2 Fast 0.23/3 0.18/3 0.14/4 0.10/4 0.07/3 0.6 0.23/3 0.18/3 0.14/4 0.10/4 0.06/8 No limit

Impact of WID Variations FI Frequency Island Clocking FI clocking is more performance and power efficient than GC because each core can run at its own fastest frequency. Previous GC Pleak,Nequation can be used to estimate Pleak,N. The equation for supply voltage scaling have to be modified as follows. N = i M Tcycle,base(VDD) = M ((1 F) / fj + F/ fi ) Tcycle(V) Estimation of Pdyn,N alsohave to account for an independent clock frequency per core. + = 1 j 1, i 1 j- 1, N N = i Pdyn,N= ((1 F) (fj + fi K) + F The fastest one among the chosen active cores always offers the optimal total power for processing the totally sequential portion of workload. fi ) (V/VDD)2 Pdyn,base 1

Impacts of WID Variations FI Frequency Island Clocking HP Fastest base core HP Slowest base core FI clocking is more power- efficient than the global clocking (GC) that often wastes Fmax of faster cores. On average, FI clocking offers 7% lower total power consumption than GC. Average total power of 100 die samples / Relative # of cores(N) Base VDD,min 0.7 F=0.6 F=0.7 F=0.8 F=0.9 F=1.0 0.63/2 0.56/2 0.50/2 0.70/2 0.42/2 Slow 0.70/2 0.62/3 0.53/3 0.44/3 0.36/2 0.6 Slow: 30~58% Fast: 81~90% 0.70/2 0.62/3 0.52/3 0.43/4 0.27/8 No limit 0.15/4 0.12/4 0.10/3 0.7 0.19/3 0.10/2 Fast 0.19/3 0.15/4 0.12/4 0.09/5 0.07/3 0.6 0.19/3 0.15/4 0.12/4 0.09/5 0.06/8 No limit

Experimental Methodology HSPICE simulations 32nm PTM HP and LP model Frequency / Leakage scaling factor A range of VDD : 0.55 ~ 1.05(V) 24 FO4 inv chain for measuring f(VDD) Complex gates for measuring l(VDD) Vth and Leff WID spatial and D2D variation map Correlation distance coefficient ( ) 0.5 WID variation sys Vth 6.4% D2D Vth 1 grid point D2D variation 5.0% [3] [3] Smruti R. Sarangi et al., VARIUS: A Model of Process Variation and Resulting Timing Errors for Microarchitects , IEEE Transactions on Semiconductor Manufacturing (IEEE TSM), February 2008.

Conclusions Optimal number of active cores to minimize total power consumption of many-core processors. 2x more active cores at lower voltageoffer more than 50% of total power reduction at the same throughput with a base core. Extended power analysis considering WID C2C frequency and leakage variations 2x more active cores at lower voltageis the optimal choice. FI clockingprovides lower power consumption than GC since it can exploit C2C variations. Also the fastest one in active cores for sequential portion of application led to the lowest power consumption.

Backup

Introduction Process variations Manufactured dies exhibit a large spread of transistor delay and leakage power across die and within each die. Die-to-die(D2D) variations affect all transistors on a die equally. Within- die(WID) variations induce different characteristics across each die. As individual core size becomes smaller, core-to-core(C2C) frequency and leakage power variations due to spatial correlated WID variations will become considerable. Source: Synopsys Spatial Within-die variations Die-to-die variations

Supply Voltage and Power Scaling2 Supply voltage scaling of many-core processors Throughput w/ a certain # of cores at max VDD (thus Fmax) = Throughput w/ more cores at lower VDD (thus Fmax) Potential throughput increase by many cores and lower VDD can reduce power. x xx x x x x x x x x x x x x xx Many Core Processor Active Core x x x x x x Idle Core [1] # of active cores 8 Operating freq Lower V than VDD # of active cores 1 Operating freq VDD

Optimizing Power Efficiency in Many-core Processors with Voltage Scaling

Download Presentation

Presentation Transcript

Related

More Related Content