Low-Energy High-Performance Computing on FPGAs

Low-Energy High-Performance Computing on FPGAs
Slide Note
Embed
Share

Leveraging High-Level Synthesis on Field-Programmable Gate Arrays (FPGAs) to achieve high-performance computing with reduced energy consumption. The approach focuses on HW efficiency akin to SW, utilizing recent HLS advances for faster HW design, exploring design space, and implementing SW on FPGA for energy efficiency. Multi-language synthesis, including C, C++, SystemC, and more, enhances development speed without requiring new languages. Simulink models integration, model-based design, and high-level synthesis for efficient HW implementation are key aspects of the project.

  • Low-Energy Computing
  • High-Performance
  • FPGAs
  • High-Level Synthesis
  • Multi-Language Synthesis

Uploaded on Mar 12, 2025 | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.

E N D

Presentation Transcript


  1. Low-Energy High-Performance Computing via High-Level Synthesis on FPGAs Luciano Lavagno luciano.lavagno@polito.it

  2. Objectives and approach Provide HW efficiency with SW-like non-recurrent engineering cost Exploit recent advances of High-Level Synthesis to enable a compilation flow for HW Reduce HW design time (in particular verification time) by using High-Level Synthesis from a variety of concurrent models Improve Quality of Results by means of manual and automated Design Space Exploration Reduce energy consumption while retaining re-programmability, by implementing SW on FPGA Electronics Group 2

  3. Multi-language synthesis No single winner in the domain of specification languages for HLS (synthesizing a high-level model into RTL) C, C++, SystemC, Simulink/Stateflow, CUDA/OpenCl have all been proposed, and have all been successful to some extent Avoid the need to learn a new language Speed up development by enabling verification using a domain- specific language C++/SystemC can act as a common intermediate language between domain-specific modeling and HLS Electronics Group 3

  4. HLS from Simulink models Simulink/Stateflow is an industry-standard model-based design tool for: Algorithmic modeling and simulation Code generation for both SW and HW HW implementation generation is limited to: One or a few architectures (cost/performance points) One or a few platforms (Xilinx- or Altera-specific tools) Our approach: exploit well-established SW code generation in C Customize code-generation for efficient HW implementation Perform automated Design Space Exploration, without designer input In case of broadly used blocks (e.g. FFT), write HLS-specific models for HW DSE Electronics Group 4

  5. Model-based design from Simulink Modeling, Simulation ,Verification Model Translation (ERT Coder) Wrapping C model into SystemC Wrapper Classes Code Profiling and Partitioning Decisions Embedded C Code Generation SystemC Wrapping Code for Embedded Processor High Level Synthesis for HW

  6. Automated Design Space Exploration Partition & Wrap TestBench RTL-1 RTL-2 RTL-3 RTL-n Logic Synthesis Switching Activity Database Gate-1 Gate-2 Gate-3 Gate-4 Gate Level Power Estimation Area Reports (Lib-cell) Power Reports Throughput Estimates Database (Excel Spread Sheet)

  7. Automated DSE results

  8. HLS from OpenCl to FPGAs A significant portion of the cost of data centers is due to energy consumption (both energy and cooling costs) A large number of data center algorithms (e.g. search, image recognition, speech recognition) are embarrassingly parallel Efficient code written in parallel languages is available FPGA implementation provides a nice alternative to fully- programmable implementation on CPUs or GP-GPUs Low-energy Reconfiguration enables use for different applications Good performance

  9. ECOSCALE project goals Scalability: 100 million computing units, billions of tasks Improve performance 1000X wrt state of the art Sharing of acceleration and storage resources without global cache coherence High energy efficiency: max 0.5W/unit Improve energy efficiency by 100X FPGA use for high-performance computing Reliability: 1000 hour MTBF Improve reliability by 1000X Dynamic FPGA reconfiguration Electronics Group 9

  10. Efficiency and programmability HW implementation on FPGA: very high energy efficiency, while keeping dynamic reconfigurability OpenCl programming: extreme parallelism with simple programming model Dynamic resource allocation: runtime FPGA reconfiguration Efficient memory access: shared global memory among all CPUs and FPGAs in a cluster, without global cache coherency 10

  11. OpenCl programming model Every kernel (functional computation unit mapped to CPU, GPU or FPGA) is divided into: Completely independent workgroups, which can be assigned at runtime to different computation resources for best resource/performance trade-off Cooperating synchronized workitems, which share local memory (SRAM) The memory hierarchy explicitly distinguishes between: Global DRAM (shared among kernels and with host code) SRAM (shared among workitems) Private registers The programmer has already solved the most difficult problems: parallelization and efficient exploitation of the memory hierarchy 11

  12. FPGA implementation Both Xilinx and Altera support OpenCl as a functional specification language Workgroups can be replicated arbitrarily, to improve performance at the cost of resource usage Workitems are pipelined for efficient HW implementation Loops within a workitem and local memory must be implemented afficiently: automated Design Space Exploration Xilinx SDAccel allows one to use almost out of the box OpenCl code, integrating functional debugging and cost/performance analysis Design Space Exploration must be automated, since it requires HW design expertis 12

  13. Application examples Financial algorithms: e.g. Black-Scholes and Heston Monte-Carlo parallel simulations: no local or global memory FPGA is much more efficient than GPU, in terms of both performance and energy-per-computation Machine learning: e.g. K-nearest-neighbors Limited by global memory bandwidth (GPU is typically better) FPGA consumes much less energy, and can exploit streaming communication Sorting: e.g. bitonic sorting Limited by global memory bandwidth (GPU is typically better) FPGA consumes much less energy, and can exploit streaming communication 13

  14. Example: Heston model of financial markets energy/step( platform t(ns) power(W) nJ) 72 70 17 GTX 960 K4200 Virtex 7 0.604 0.663 1.424 120 105 12 No global memory use ensures competitive performance 14

  15. Example: K-nearest-neighbors Platform GTX 960 K4200 Virtex 7 Time 930ms 3110ms 1ms Power(W) Energy(J) 120 111.6 108 335.88 3 0.0039 On-chip global memory ensures very high performance via streaming 15

  16. Summary OpenCl and FPGAs provide an almost-ideal platform for highly- parallel software for data centers Excellent energy-per-computation savings, with good performance Some FPGA-specific high-level optimization is required E.g. to exploit global memory access bursts Several examples from different application domains provide encouraging results Design space exploration is much easier than with other (less embarrassingly parallel) models Dynamic resource management is key to data center and high-performance computing applicability 16

More Related Content