
Efficient Realization of SIMD Defragmenter on Data-parallel Architectures
Explore the efficient implementation of SIMD Defragmenter, a solution for data-parallel architectures, presented by researchers from University of Michigan, Qualcomm Incorporated, and Intel Labs. The study covers topics like functional convergence, SIMD as an alternative to ASICs, and solutions for under-utilization in multimedia applications.
Download Presentation

Please find below an Image/Link to download the presentation.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.
You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.
E N D
Presentation Transcript
SIMD Defragmenter: Efficient ILP Realization on Data-parallel Architectures Yongjun Park1, Sangwon Seo2, Hyunchul Park3, Hyoun Kyu Cho1, and Scott Mahlke1 March 6, 2012 1University of Michigan, Ann Arbor 2Qualcomm Incorporated, San Diego, CA 3Programmin Systems Lab, Intel Labs, Santa Clara, CA University of Michigan 1 Electrical Engineering and Computer Science
Convergence of Functionalities Flexible Accelerator! 4G Wireless Audio Video 3D Navigation Anatomy of an iPhone Convergence of functionalities demands a flexible solution Applications have different characteristics University of Michigan 2 Electrical Engineering and Computer Science
SIMD : Attractive Alternative to ASICs Suitable for running wireless and multimedia applications for future embedded systems 5.6x 500 450 Relative Area (Cost) Advantage 400 VLIW High throughput Low fetch-decode overhead Easy to scale 350 300 250 2x 200 SIMD 150 Disadvantage Hard to realize high resource utilization High SIMDization overhead 100 FUs 50 0 0 16 32 48 Issue width 64 80 96 112 128 Example SIMD architectures IBM Cell, ARM NEON, Intel MIC architecture, etc. Example SIMD machine: 100 MOps /mW 3 University of Michigan Electrical Engineering and Computer Science
Under-utilization on wide SIMD AAC 3D H.264 16-way SIMD Resource utilization Full Under Execution time distribution at different SIMD widths Multimedia applications have various natural SIMD widths SIMD width characterization of innermost loops (Intel Compiler rule) Inside & across applications How to use idle SIMD resources? University of Michigan 4 Electrical Engineering and Computer Science
Traditional Solutions for Under-utilization Dynamic power gating Selectively cut off unused SIMD lanes Effective dynamic & leakage power savings Transition time & power overhead High area overhead Thread 1 On Thread 2 Thread Level parallelism Execute multiple threads having separate data Different instruction flow Input-dependent control flow High memory pressure Thread 3 Off Thread 4 University of Michigan 5 Electrical Engineering and Computer Science
Objective of This Work Beyond loop-level SIMD Put idle SIMD lanes to work Find more SIMD opportunities inside vectorized basic blocks when loop-level SIMD parallelism is insufficient Possible SIMD instructions inside a vectorized basic block Perform same work Same data flow More than 50% of total instructions have some opportunities Challenges High data movement overhead between lanes Hard to find best instruction packing combination University of Michigan 6 Electrical Engineering and Computer Science
Partial SIMD Opportunity 1. Loop level SIMDization 2. Partial SIMDization SIMD Resource 1: For (it = 0; It < 4; it++) { 2: i = a[it] + b[it]; 3: j = c[it] + d[it]; 4: k = e[it] + f[it]; 5: l = g[it] + h[it]; 6: m = i + j; 7: n = k + l; 8: result[it] = m + n; 9: } 0 1 2 3 4 5 6 7 8 9 +12 +4 10 11 12 13 14 15 University of Michigan 7 Electrical Engineering and Computer Science
Subgraph Level Parallelism(SGLP) Data level parallelism between identical subgraphs SIMDizable operators Isomorphic dataflow No dependencies on each other Cost Advantages Minimize overhead No inter-lane data movement inside a subgraph Gain Cost Cost Gain Gain High instruction packing gain Multiple instructions inside a subgraph increase the packing gain University of Michigan 8 Electrical Engineering and Computer Science
Example: program order (2 degree) FFT kernel SIMD Lane LD0 LD1 0 2 1 3 Inter-lane move Inter-lane move Inter-lane move Inter-lane move Inter-lane move Inter-lane move Inter-lane move Inter-lane move ST1 LD1 ST3 11 1 3 5 7 9 0 1 5 7 4 6 ST0 LD0 ST2 10 0 2 4 6 8 9 1 1 8 10 Cycle ST1 ST0 ST3 ST2 Gain: 1 = 9 (SIMD) 8 (overhead) Lane 1 Lane 0 University of Michigan 9 Electrical Engineering and Computer Science
Example: SGLP (2 degree) FFT kernel SIMD Lane LD0 LD1 0 2 1 3 Inter-lane move Inter-lane move ST2 ST3 LD1 10 11 6 2 3 7 0 1 5 7 4 6 ST0 ST1 LD0 4 8 0 1 5 9 9 1 1 8 10 Cycle ST1 ST0 ST3 ST2 Gain: 7 = 9 (SIMD) 2 (overhead) Lane 1 Lane 0 University of Michigan 10 Electrical Engineering and Computer Science
Compilation Overview Hardware Information Application Loop-unrolling & Vectorization Dataflow Generation Loop-level Vectorized Basicblock Dataflow Graph 1. Subgraph Identification Identical Subgraphs 2. SIMD Lane Assignment Lane-assigned Subgraphs 3. Code Generation University of Michigan 11 Electrical Engineering and Computer Science
1. Subgraph Identification Heuristic discovery Grow subgraphs from seed nodes and find identical subgraphs 256 a b c d e f g h 2 1 Additional conditions over traditional subgraph search Corresponding operators are identical Operand type should be same: register/constant No inter-subgraph dependencies * + result University of Michigan 12 Electrical Engineering and Computer Science
2. SIMD Lane Assignment Select subgraphs to be packed and assign 2. Affinity cost Data movement between different subgraphs Use producer/consumer, common producer/consumer relation Affinity value: how many related operations exist between subgraphs 1. Subgraph gain Assign lanes based on decreasing order of gain them to SIMD lane groups Pack maximum number of instructions with minimum overhead Safe parallel execution without dependence violation Criteria: gain, affinity, partial order SIMD Lane 3. Partial order check Partial order of identical subgraphs inside the SIMD lanes must be same Gain: A > B > C > D Assign a lane with highest affinity cost Partial order: C0 C1 A1 A1 A0 A0 Affinity: B0 is closer to A0 than A1 B1 B1 C0 Conflict!! Lane 4~7 C0-1 C1-1 A1 C1 B1 B0 B0 C1 Lane 0~3 D D C0-0 C1-0 C0 A0 B0 D Time University of Michigan 13 Electrical Engineering and Computer Science
Experimental Setup 144 loops from industry-level optimized media applications AAC decoder (MPEG4 audio decoding, low complexity profile) H.264 decoder (MPEG4 video decoding, baseline profile, qcif) 3D (3D graphics rendering). Target architecture: wide vector machines SIMD width: 16 ~ 64 SODA-style wide vector instruction set Single-cycle delay data shuffle instruction( vperm(VMX), vex_perm(AltiVec)) IMPACT frontend compiler + cycle-accurate simulator Compared to 2 other solutions SLP: superword level parallelism (basic block level SIMDization) [Larsen, PLDI 00] ILP: Instruction level parallelism on same-way VLIW machines Apply 2 ~ 4 degree SGLP University of Michigan 14 Electrical Engineering and Computer Science
Static Performance SLP w/ overhead SLP w/o overhead SGLP w overhead SGLP w/o overhead ILP ILP 3 3 AAC H.264 3D AVG Relative Performance Relative Performance 2.5 2.5 2 2 1.5 1.5 1 1 2 2 3 3 4 4 2 2 3 3 4 4 2 2 3 3 4 4 2 2 3 3 4 4 # of degree # of degree SGLP retains similar trend as ILP after overhead consideration Max 1.66x @ 4-way (SLP 1.27x) See the paper for representative kernels (FFT, DCT, HafPel .) University of Michigan 15 Electrical Engineering and Computer Science
Dynamic Performance on SIMD SLP w/ overhead SGLP w/ overhead ILP 3 Relative Performance AAC H.264 AVG 3D 2.5 2 1.5 1 16 32 64 16 32 64 16 32 64 16 32 64 # of SIMD lanes Only when a natural SIMD width is insufficient, the available degree of SGLP is exploited. (Up to 4-way) Max 1.76x speedups (SLP: 1.29x) University of Michigan 16 Electrical Engineering and Computer Science
Energy@H.264 Execution 200 MHz(IBM 0.65nm technology) SIMD VLIW Control Control Control Control Control 8-wide SIMD 8-wide SIMD 8-wide SIMD 8-wide SIMD 8-wide SIMD 8-wide SIMD 8-wide SIMD 8-wide SIMD SGLP @ 32-wide SIMD 54.40 13.07 3.55 ILP @ 4 way 8-wide VLIW 93.17 10.77 5.02 gain -31.61% +21.36% -29.14% power (mW) cycle(million) energy (mJ) 30% energy efficient! University of Michigan 17 Electrical Engineering and Computer Science
Conclusion SIMD is an energy-efficient solution for mobile systems. SIMD programming of multimedia applications is an interesting challenge due to various degrees of SIMD parallelism. Subgraph level parallelism successfully provides supplemental SIMD parallelism by converting ILP into DLP inside the vectorized basic block. SGLP outperforms traditional loop-level SIMDization by up to 76% on a 64-way SIMD architecture. University of Michigan 18 Electrical Engineering and Computer Science
Questions? For more information http://cccp.eecs.umich.edu University of Michigan 19 Electrical Engineering and Computer Science
Example 2: High-level View SIMD Lane Kernel 1 Kernel 0 SIMD width: 8 A0 A0 A0 A1 A1 A1 Lane 4~7 Kernel 0 Kernel 2 Kernel 1 SIMD width: 4 B B B Lane 0~3 C0 C0 C0 C1 C1 C1 Kernel 2 SIMD width: 8 D D D Time Gain = (A1 + C1) (SIMD) ((A1->B) + (C1->D)) (overhead) University of Michigan 20 Electrical Engineering and Computer Science
Static Performance SLP w/o overhead SLP w/ overhead SGLP w/o overhead SGLP w overhead ILP ILP 4 4 Relative Performance Relative Performance 3.5 3.5 3 3 2.5 2.5 2 2 1.5 1.5 1 1 2 2 3 3 4 4 2 2 3 3 4 4 2 2 3 3 4 4 2 2 3 3 4 4 2 2 3 3 4 4 2 2 3 3 4 4 2 2 3 3 4 4 2 2 3 3 4 4 2 2 3 3 4 4 2 2 3 3 4 4 # of degree # of degree FFT MDCT MatMul4x4 MatMul3x3HalfPel QuarterPel Kernel AAC 3D H.264 AVG Application Performance results depend on kernel characteristics(Ex: MatMul4x4, MatMul3x3) SGLP retains similar trend as ILP after overhead consideration Max 1.66x @ 4-way (SLP 1.27x) University of Michigan 21 Electrical Engineering and Computer Science