A Modular Heterogeneous Stack for Deploying FPGAs and CPUs in the Data Center
This discusses the implementation of FPGA and CPU clusters in data centers focusing on communication architecture models. It covers advancements in FPGA technology by Microsoft and Amazon, along with the communication between accelerators and CPUs in the cluster setup. The comprehensive analysis sheds light on the evolution of reconfigurable computing and its impact on data center performance. (331 characters)
Download Presentation

Please find below an Image/Link to download the presentation.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.
You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.
E N D
Presentation Transcript
A Modular Heterogeneous Stack for Deploying FPGAs and CPUs in the Data Center Nariman Eskandari, Naif Tarafdar, Daniel Ly-Ma, Paul Chow High-Performance Reconfigurable Computing Group University of Toronto February 28, 2025
FPGAs in Clouds and Data Centers? Microsoft Catapult v1(2014) 10% more power, 95% throughput Catapult v2 (2017) Brainwave (2017) 2 February 28, 2025
FPGAs in Clouds and Data Centers? Microsoft Catapult v1(2014) 10% more power, 95% throughput Catapult v2 (2017) Brainwave (2017) 3 Amazon AWS F1(2017) February 28, 2025
Background: Heterogeneous Communication Architecture models for FPGA and CPU clusters: CPU CPU CPU CPU Network Network FPGA FPGA FPGA FPGA 4 Slave Model Peer Model February 28, 2025
Background: Heterogeneous Communication Architecture models for FPGA and CPU clusters: CPU CPU CPU CPU Network Network FPGA FPGA FPGA FPGA 5 Slave Model Peer Model February 28, 2025
Background: Heterogeneous Communication Architecture models for FPGA and CPU clusters: CPU CPU CPU CPU Network Network FPGA FPGA FPGA FPGA 6 Slave Model Peer Model February 28, 2025
Background: Heterogeneous Communication Architecture models for FPGA and CPU clusters: CPU CPU CPU CPU Network Network FPGA FPGA FPGA FPGA 7 Slave Model Peer Model February 28, 2025
Background: Heterogeneous Communication Architecture models for FPGA and CPU clusters: CPU CPU CPU CPU Network Network FPGA FPGA FPGA FPGA 8 Slave Model Peer Model Communication Peer model: No different way to communicate between Accelerators and CPUs Easier Direct connection between accelerators February 28, 2025
Background: System Orchestration Request User User issues request Give user network handle of cluster Resources connected on network Network Handle of Cluster Heterogeneous Cloud Provider 9 FPGA FPGA CPU CPU Network February 28, 2025
Contributions Galapagos: Rearchitected work in [FPGA 2017] to focus on modularity Previously large monolithic layer Modularity allows users to experiment with design space for heterogeneous clusters Also addressed scalability issues in [FPGA 2017] 1 0 1 Naif Tarafdar et al. Enabling Flexible Network FPGA Clusters in a Heterogeneous Cloud Data Center, FPGA 2017. February 28, 2025
Contributions Galapagos: Rearchitected work in [FPGA 2017] to focus on modularity Previously large monolithic layer Modularity allows users to experiment with design space for heterogeneous clusters Also addressed scalability issues in [FPGA 2017] HUMboldt (Heterogeneous Uniform Messaging) : A Communication Layer Heterogeneous (multi-FPGA and CPU) Same high-level code for software and hardware (portable) Easy to use in order to make scalable applications for CPU/FPGA Cluster. 1 1 1 Naif Tarafdar et al. Enabling Flexible Network FPGA Clusters in a Heterogeneous Cloud Data Center, FPGA 2017. February 28, 2025
Outline Galapagos HUMboldt Results Conclusion Future Work 1 2 February 28, 2025
Outline Galapagos HUMboldt Results Conclusion Future Work 1 3 February 28, 2025
Heterogeneous Abstraction Stack Monolithic Orchestration Layer [FPGA 2017] 1 4 February 28, 2025
Heterogeneous Abstraction Stack HUMboldt 1 5 Galapagos February 28, 2025
Galapagos: Middleware Layer User can define a FPGA cluster using cluster description files and AXI-Stream kernels VM FPGA 1 File Network Tool Flow 9 VM VM Kernel Kernel Kernel AXI-Stream AXI-Stream FPGA 2 FPGA 3 February 27, 2019 Xilinx Update
Galapagos: Middleware Layer User can define a FPGA cluster using cluster description files and AXI-Stream kernels VM FPGA 1 File Network Tool Flow 9 VM VM Kernel Kernel Kernel AXI-Stream AXI-Stream FPGA 2 FPGA 3 February 27, 2019 Xilinx Update
Galapagos: Middleware Layer User can define a FPGA cluster using cluster description files and AXI-Stream kernels VM FPGA 1 File Network Tool Flow 9 VM VM Kernel Kernel Kernel AXI-Stream AXI-Stream FPGA 2 FPGA 3 February 27, 2019 Xilinx Update
Galapagos: Middleware Layer User can define a FPGA cluster using cluster description files and AXI-Stream kernels VM FPGA 1 File Network Tool Flow 9 VM VM Kernel Kernel Kernel AXI-Stream AXI-Stream FPGA 2 FPGA 3 February 27, 2019 Xilinx Update
Galapagos Hypervisor CPU The Hypervisor abstracts all the I/O interfaces Limitations of the base infrastructure: Closed source (proprietary IPs) Not easily portable for other boards support (Heterogeneity) Driver PCIE 2 x D D R Application Region 2 0 Network February 28, 2025
Galapagos Hypervisor CPU The Hypervisor abstracts all the I/O interfaces Limitations of the base infrastructure: Closed source (proprietary IPs) Not easily portable for other boards support (Heterogeneity) Driver PCIE 2 x D D R Application Region 2 1 Redesigned using publicly available Xilinx IPs Supports higher level of the network stack. IP layer Transport Layer (e.g. TCP) Network February 28, 2025
Galapagos Hypervisor CPU The Hypervisor abstracts all the I/O interfaces Limitations of the base infrastructure: Closed source (proprietary IPs) Not easily portable for other boards support (Heterogeneity) Driver PCIE 2 x D D R Application Region 1 1 Redesigned using publicly available Xilinx IPs Supports higher level of the network stack. IP layer Transport Layer (e.g. TCP) FPGA FPG Network February 28, 2025
Galapagos: Hypervisor Application Region 2 3 [FPGA 2017] Galapagos February 28, 2025
Galapagos: Hypervisor Application Region 2 4 [FPGA 2017] Galapagos February 28, 2025
Galapagos: Application Region Router 2 5 [FPGA 2017] Galapagos February 28, 2025
Galapagos: Application Region Network Bridge 2 6 [FPGA 2017] Galapagos February 28, 2025
Galapagos: Application Region Comm Bridge 2 7 [FPGA 2017] Galapagos February 28, 2025
Outline Base infrastructure Galapagos HUMboldt Results 2 8 Conclusion Future Work February 28, 2025
HUMboldt (Heterogeneous Uniform Messaging) Communication Layer A message passing communication layer A minimal subset of MPI Only blocking send and receives Software and Hardware library Exact same source code for both hardware and software Functional portability 2 9 February 28, 2025
HUMboldt Hardware All the functions are implemented as High-Level Synthesis (HLS) functions Library for user to integrate in HLS code Functional portability Easy to use The underlying protocol is handled by Galapagos 3 0 February 28, 2025
HUMboldt Software Uses standard socket programming libraries TCP and Ethernet Software kernels communicating through a mature software MPI library (MPICH) It parses the cluster description files at runtime to choose the right protocol Hardware node: HUMboldt Software nodes: MPICH 3 1 February 28, 2025
System Tool Flow HUMboldt has two branches for creating the entire cluster Software kernels Hardware kernels Same code can be used for both software and hardware kernels 3 2 February 28, 2025
System Tool Flow HUMboldt has two branches for creating the entire cluster Software kernels Hardware kernels Same code can be used for both software and hardware kernels 3 3 February 28, 2025
System Tool Flow HUMboldt has two branches for creating the entire cluster Software kernels Hardware kernels Same code can be used for both software and hardware kernels 3 4 February 28, 2025
System Tool Flow <cluster> <node> Ease of use Changing the underlying protocol <type> sw </type> <kernel> 0 </kernel> <mac_addr> ac:c4:7a:88:c0:47 </mac_addr> <ip_addr> 10.1.2.152 </ip_addr> </node> <node> <appBridge> <name> Humboldt_bridge</name> </appBridge> <board> adm-8k5-debug </board> <type> hw </type> <comm> eth </comm> <kernel> 1 </kernel> . . . <kernel> 16 </kernel> <mac_addr> fa:16:3e:55:ca:02 </mac_addr> <ip_addr> 10.1.2.101 </ip_addr> </node> </cluster> 3 5 February 28, 2025
System Tool Flow <cluster> <node> Ease of use Changing the underlying protocol Changing from software to hardware <type> sw </type> <kernel> 0 </kernel> <mac_addr> ac:c4:7a:88:c0:47 </mac_addr> <ip_addr> 10.1.2.152 </ip_addr> </node> <node> <appBridge> <name> Humboldt_bridge</name> </appBridge> <board> adm-8k5-debug </board> <type> hw </type> <comm> eth </comm> <kernel> 1 </kernel> . . . <kernel> 16 </kernel> <mac_addr> fa:16:3e:55:ca:02 </mac_addr> <ip_addr> 10.1.2.101 </ip_addr> </node> </cluster> 3 6 February 28, 2025
Outline Goals Contributions Previous Works Galapagos HUMboldt Results 3 7 Conclusion Future Work February 28, 2025
Results: Testbed The testbed that is used is a cluster of: Intel Xeon E5-2650 (2.20 GHz) 12 physical core, 24 threads Alpha Data ADM-PCIE-8k5 Xilinx KU115 UltraScale devices 3 8 February 28, 2025
Galapagos/HUMboldt Resource Utilization Abstraction Layer IP LUTs Flip-Flops BRAMs I) Hypervisor 14.4 % 9.1 % 11.8 % II) Network Bridge TCP 4.4 % 2.4 % 0.1 % III) Network Bridge Ethernet 0.1 % 0.1 % 0.1 % 3 9 IV) HUMboldt Bridge 0.1 % 0.1% 0.05 % V) Router 0.8 % 0.5 % 0.05 % Total TCP (1 + II + IV + V) Total Ethernet (1 + III + IV + V) 19.7 % 12.1 % 15.9 % 15.3 % 9.7 % 12.0 % February 28, 2025
Galapagos/HUMboldt Resource Utilization Abstraction Layer IP LUTs Flip-Flops BRAMs I) Hypervisor 14.4 % 9.1 % 11.8 % II) Network Bridge TCP 4.4 % 2.4 % 0.1 % III) Network Bridge Ethernet 0.1 % 0.1 % 0.1 % 4 0 IV) HUMboldt Bridge 0.1 % 0.1% 0.05 % V) Router 0.8 % 0.5 % 0.05 % Total TCP (1 + II + IV + V) Total Ethernet (1 + III + IV + V) 19.7 % 12.1 % 15.9 % 15.3 % 9.7 % 12.0 % February 28, 2025
Galapagos/HUMboldt Resource Utilization Abstraction Layer IP LUTs Flip-Flops BRAMs I) Hypervisor 14.4 % 9.1 % 11.8 % II) Network Bridge TCP 4.4 % 2.4 % 0.1 % III) Network Bridge Ethernet 0.1 % 0.1 % 0.1 % 4 1 IV) HUMboldt Bridge 0.1 % 0.1% 0.05 % V) Router 0.8 % 0.5 % 0.05 % Total TCP (1 + II + IV + V) Total Ethernet (1 + III + IV + V) 19.7 % 12.1 % 12.9 % 15.3 % 9.7 % 12.0 % February 28, 2025
Galapagos/HUMboldt Resource Utilization Abstraction Layer IP LUTs Flip-Flops BRAMs I) Hypervisor 14.4 % 9.1 % 11.8 % II) Network Bridge TCP 4.4 % 2.4 % 0.1 % III) Network Bridge Ethernet 0.1 % 0.1 % 0.1 % 4 2 IV) HUMboldt Bridge 0.1 % 0.1% 0.05 % V) Router 0.8 % 0.5 % 0.05 % Total TCP (1 + II + IV + V) Total Ethernet (1 + III + IV + V) 19.7 % 12.1 % 15.9 % 15.3 % 9.7 % 12.0 % February 28, 2025
Galapagos/HUMboldt Resource Utilization Abstraction Layer IP LUTs Flip-Flops BRAMs I) Hypervisor 14.4 % 9.1 % 11.8 % II) Network Bridge TCP 4.4 % 2.4 % 0.1 % III) Network Bridge Ethernet 0.1 % 0.1 % 0.1 % 4 3 IV) HUMboldt Bridge 0.1 % 0.1% 0.05 % V) Router 0.8 % 0.5 % 0.05 % Total TCP (1 + II + IV + V) Total Ethernet (1 + III + IV + V) 19.7 % 12.1 % 15.9 % 15.3 % 9.7 % 12.0 % February 28, 2025
Results: Microbenchmarks FPGA HUMboldt Hardware Kernel Hardware Kernel FPGA FPGA HUMboldt Hardware Kernel Hardware Kernel FPGA CPU HUMboldt 4 4 Hardware Kernel Software Kernel CPU FPGA HUMboldt Software Kernel Hardware Kernel CPU CPU MPICH Software Kernel Software Kernel February 28, 2025
Results: Throughput 4 5 Ethernet February 28, 2025
Results: Throughput 4 6 Ethernet February 28, 2025
Results: Throughput 4 7 TCP February 28, 2025
Results: Throughput 4 8 TCP February 28, 2025
Results: Latency Zero-payload packets Relevant Comparison: Microsoft Catapult V2 40 G Ethernet Lightweight transport layer 2.88 s round trip Microbenchmarks Ethernet ( s) TCP ( s) 4 9 Hardware to Hardware (same node) 0.2 0.2 Hardware to hardware (different node) 5.7 15.2 Software to Hardware 27.5 48.8 Hardware to Software 34.7 113.6 February 28, 2025
Results: Latency Zero-payload packets Relevant Comparison: Microsoft Catapult V2 40 G Ethernet Lightweight transport layer 2.88 s round trip Microbenchmarks Ethernet ( s) TCP ( s) 5 0 Hardware to Hardware (same node) 0.2 0.2 Hardware to hardware (different node) 5.7 15.2 Software to Hardware 27.5 48.8 Hardware to Software 34.7 113.6 February 28, 2025