
Accelerated FPGA-Pipelined Discrete-Event Simulation for HPC Systems
Explore a FPGA-pipelined approach for accelerated discrete event simulation of high-performance computing systems. The study focuses on co-design using behavioral emulation and fully-expanded & collapsed pipeline methodologies. Learn about constructing data flow graphs, mapping to FPGA pipelines, and optimizing simulation configurations. Dive deep into the intricacies of algorithmic and architectural design-space exploration, balancing simulation speed and accuracy for rapid evaluation.
Download Presentation

Please find below an Image/Link to download the presentation.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.
You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.
E N D
Presentation Transcript
CCMT A FPGA-Pipelined Approach for Accelerated Discrete-Event Simulation of HPC Systems Carlo Pascoe, Sai P. Chenna, Greg Stitt, Herman Lam PSAAPII Center for Compressible Multiphase Turbulence (CCMT) NSF Center for High-Performance Reconfigurable Computing (CHREC) Dept. of ECE, University of Florida, Gainesville FL 32608, USA {pascoe, chenna, gstitt, hlam}@hcs.ufl.edu CCMT
Co-Design Using Behavioral Emulation (BE) Coarse-grained Simulation Platforms BE Simulation Simulation Platforms - BE SST - FPGA Acceleration * BEO Behavioral Emulation Object HW/SW co-design Algorithmic & architectural design-space exploration (DSE) BE is coarse-grained simulation Balance of simulation speed & accuracy for rapid design-space evaluation CCMT | 2
Fully-expanded Pipeline (FEP) 1. Extracting DFG from BE simulation configuration 1. Construct Data Flow Graph (DFG) from simulation configuration AppBEO+ArchBEO define instructions and operand/output dependencies Instructions map to vertices and dependencies map to edges in DFG Various opportunities for graph-level optimizations Configuration Mapping DFG if(id==0) send; else recv; mm; if(id==0) recv; else send; AppBEOS send recv mm proc mm comm ArchBEOS send recv proc 2. Mapping DFG to FPGA Pipeline Pipelined Simulation 2. Map DFG to pipeline circuit Vertex attributes define operations and instantiate dedicated HW Edge attributes (e.g., src/dst) instantiate pipeline register between src/dst pair Various opportunities for circuit-level optimizations prev.s Event Attributes MM FUNC. Type: comp Subtype: mm TID: 0 EID: 1 Parameters: [256,256,256] mm_out.s prev.s1 prev.s Type: comm Subtype: recv TID: 1 EID: 0 STID: 0 SEID: 0, msg size RECV FUNC. Because each instruction (from sim) mapped to independent HW (no resource sharing), each vertex able to start next sim 1 cycle after current sim recv_out.s CCMT | 3
Collapsed Pipeline (CP) 1. Construct Data Flow Graph (DFG) from simulation configuration AppBEO+ArchBEO define instructions and operand/output dependencies Instructions map to vertices and dependencies map to edges in DFG Partition into linear subgraphs and generate dependency lists 1. Extracting DFG, then identify subgraphs & dependencies Configuration Mapping DFG DFG if(id==0) send; else recv; mm; if(id==0) recv; else send; AppBEOS send recv send recv mm mm mm proc mm send recv comm ArchBEOS (1,1) (2,1) (2,3) (1,3) send recv proc 2. Mapping DFG to FPGA Pipeline 2. Map DFG to pipeline circuit Vertex attributes define operations and Edge attributes instantiate pipeline register between src/dst pairs Align subgraph traces such that cost is minimized and no dependencies are violated send recv delay delay mm Align recv delay send delay mm send recv delay delay mux mux Map mm recv delay send delay Because each subgraph instruction mapped to independent HW, each vertex able to start next subgraph 1 cycle after current subgraph. All subgraphs must complete before sim can complete CCMT | 4
Fully-Expanded & Collapsed Pipeline Tradeoffs Num. of Events 1,344 2,688 4,032 5,376 6,720 8,064 10,752 21,504 43,008 2,880 5,760 8,640 11,520 23,040 46,080 92,160 5,952 11,904 17,856 23,808 47,616 95,232 190,464 Latency (cycles) 64 / 278 118 / 512 172 / 746 226 / 980 280 / 1,214 334 / 1,488 / 1,916 / 3,788 / 7,532 65 / 394 119 / 712 173 / 1,030 / 1,348 / 2,620 / 5,164 / 10,252 66 / 458 / 776 / 1,094 / 1,412 / 2,684 / 5,228 / 10,316 Hardware MSPS 300 / 10.5 300 / 10.5 300 / 10.5 300 / 10.5 300 / 10.5 300 / 10.5 / 10.5 / 10.5 / 10.5 300 / 5.23 300 / 5.23 300 / 5.23 / 5.23 / 5.23 / 5.23 / 5.23 300 / 2.62 / 2.62 / 2.62 / 2.62 / 2.62 / 2.62 / 2.62 Hardware GEPS 403 / 14.1 806 / 28.2 1,210 / 42.3 1,613 / 56.4 2,016 / 70.6 2,419 / 84.7 / 113 / 226 / 452 864 / 15.1 1,728 / 30.1 2,592 / 45.2 / 60.2 / 121 / 241 / 482 1,786 / 15.6 / 31.2 / 46.8 / 62.4 / 125 / 250 / 499 BE-SST KEPS Hardware Speedup 92x106 / 3x106 103x106 / 4x106 114x106 / 4x106 125x106 / 4x106 136x106 / 5x106 147x106 / 5x106 / 6x106 / 9x106 / 16x106 112x106 / 2x106 137x106 / 2x106 160x106 / 3x106 / 3x106 / 5x106 / 9x106 / 17x106 165x106 / 1x106 / 2x106 / 3x106 / 3x106 / 6x106 / 11x106 / 22x106 Ranks TS % LU 32 32 32 32 32 32 32 32 32 64 64 64 64 64 64 64 1 2 3 4 5 6 8 15 / 2 31 / 3 46 / 4 61 / 6 76 / 7 92 / 9 / 12 / 24 / 44 32 / 2 65 / 4 99 / 5 / 7 / 14 / 29 / 46 66 / 2 / 4 / 5 / 7 / 15 / 30 / 47 4.4 7.8 10.6 12.9 14.8 16.5 19.1 24.2 28.9 7.7 12.6 16.2 17.9 23.2 26.9 29.1 10.8 14.9 17.1 18.4 20.8 22.3 22.8 16 32 1 2 3 4 8 16 32 128 128 128 128 128 128 128 1 2 3 4 8 16 32 Mega-Simulations-Per-Second, Giga/Kila-Events-Per-Second, indicates configuration unable to fit on a single FPGA Fully-Expanded Pipeline Advantages: Superior performance in terms of simulation throughput and latency 280 - 320 MHz implies 280 - 320 million simulations per second independent of simulated MPI ranks Limitations: Resources scale linearly with both MPI Ranks and number of timesteps Scaling across multiple FPGAs expected to be ineffective when considering exascale simulations CCMT | 5
Fully-Expanded & Collapsed Pipeline Tradeoffs Num. of Events 1,344 2,688 4,032 5,376 6,720 8,064 10,752 21,504 43,008 2,880 5,760 8,640 11,520 23,040 46,080 92,160 5,952 11,904 17,856 23,808 47,616 95,232 190,464 Latency (cycles) 64 / 278 118 / 512 172 / 746 226 / 980 280 / 1,214 334 / 1,488 / 1,916 / 3,788 / 7,532 65 / 394 119 / 712 173 / 1,030 / 1,348 / 2,620 / 5,164 / 10,252 66 / 458 / 776 / 1,094 / 1,412 / 2,684 / 5,228 / 10,316 Hardware MSPS 300 / 10.5 300 / 10.5 300 / 10.5 300 / 10.5 300 / 10.5 300 / 10.5 / 10.5 / 10.5 / 10.5 300 / 5.23 300 / 5.23 300 / 5.23 / 5.23 / 5.23 / 5.23 / 5.23 300 / 2.62 / 2.62 / 2.62 / 2.62 / 2.62 / 2.62 / 2.62 Hardware GEPS 403 / 14.1 806 / 28.2 1,210 / 42.3 1,613 / 56.4 2,016 / 70.6 2,419 / 84.7 / 113 / 226 / 452 864 / 15.1 1,728 / 30.1 2,592 / 45.2 / 60.2 / 121 / 241 / 482 1,786 / 15.6 / 31.2 / 46.8 / 62.4 / 125 / 250 / 499 BE-SST KEPS Hardware Speedup 92x106 / 3x106 103x106 / 4x106 114x106 / 4x106 125x106 / 4x106 136x106 / 5x106 147x106 / 5x106 / 6x106 / 9x106 / 16x106 112x106 / 2x106 137x106 / 2x106 160x106 / 3x106 / 3x106 / 5x106 / 9x106 / 17x106 165x106 / 1x106 / 2x106 / 3x106 / 3x106 / 6x106 / 11x106 / 22x106 Ranks TS % LU 32 32 32 32 32 32 32 32 32 64 64 64 64 64 64 64 1 2 3 4 5 6 8 15 / 2 31 / 3 46 / 4 61 / 6 76 / 7 92 / 9 / 12 / 24 / 44 32 / 2 65 / 4 99 / 5 / 7 / 14 / 29 / 46 66 / 2 / 4 / 5 / 7 / 15 / 30 / 47 4.4 7.8 10.6 12.9 14.8 16.5 19.1 24.2 28.9 7.7 12.6 16.2 17.9 23.2 26.9 29.1 10.8 14.9 17.1 18.4 20.8 22.3 22.8 16 32 1 2 3 4 8 16 32 128 128 128 128 128 128 128 1 2 3 4 8 16 32 Mega-Simulations-Per-Second, Giga/Kila-Events-Per-Second, indicates configuration unable to fit on a single FPGA Collapsed Pipeline Advantages: Resources scale linearly with timesteps, but sublinearly with MPI Ranks allows for significantly more timesteps due to its much lower base utilization Better scaling on single and multiple FPGAs Limitations: Lower simulation throughput and longer initial latency, but still more than sufficient for rapid design-space exploration CCMT | 6
Collapsed Pipeline Single-FPGA Performance/Scalability Num. of Events Hardware GEPS BE-SST KEPS Hardware Speedup Latency (cycles) Hardware MSPS Ranks TS % LU 44 44x106 32 32 43,008 7,532 450 10.14 10.5 46 64 32 92,160 10,252 482 19x106 24.91 5.23 128 32 190,464 47 10,316 498 20x106 24.42 2.62 256 32 393,216 13,516 515 21x106 67 23.92 1.31 512 32 811,008 20,684 531 25x106 84 21.47 0.65 29x106 1K 32 1,646,592 84 21,196 539 18.21 0.33 44x106 32K 32 55,443,456 241,868 567 1.02x10-2 84 12.96 32x106 128K 16 111,673,344 333,932 285 2.56x10-3 46 8.91 X X 1M 2 112,459,776 1,148,056 36 3.19x10-4 5 As number of ranks increase: additional logic per rank approaches zero simulation throughput reduced by a factor of ranks event throughput remains proportional to instantiated event hardware Now able to reach million+ rank simulations in hardware Rank limit due to insufficient blockram not logic Pipelines scale linearly with length of simulation, however blockram eventually becomes limiting resource e.g., only fit 2 TS for 1 million ranks on Stratix V Motivation to explore partially-collapsed pipeline approach Expect greater performance when Stratix 10 becomes available Increased performance expected with Stratix 10 Collapsed hardware specifically designed to exploit new Stratix 10 architecture Throughput > 10Mx higher than BE-SST CCMT *Collapsed performance of CMT-Bone-BE with varied MPI ranks and simulation timesteps on a single Stratix V S5GSMD8K1F40C2 @ 335MHz | 7