
Exploring Cloud Infrastructure for Apps with Hardware Thesis Defense
"Discover the potential of re-programmable hardware in mobile phones, allowing apps to utilize hardware functionality. Gain insights into new hardware capabilities and improved performance, with case studies showcasing FPGA implementations. Learn about the industry shift towards System on Chips with FPGAs and ARM Cortex A. Thesis defense details included."
Download Presentation

Please find below an Image/Link to download the presentation.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.
You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.
E N D
Presentation Transcript
Cloud RTR: Cloud Infrastructure for Apps with Hardware Thesis Defense April 16th2015 Ali Yasser Ismail 3/18/2025 1 Electrical, Computer and Energy Engineering Department
Goal 3/18/2025 2 Electrical, Computer and Energy Engineering Department
Apps with Hardware (Mbware) Give mobile phones re-programmable hardware (FPGA) The App Store App + HW App + HW Allow applications to utilize hardware Give developers a way to create and deploy their own hardware Smart Phone 3/18/2025 3 Electrical, Computer and Energy Engineering Department
Benefits 3/18/2025 4 Electrical, Computer and Energy Engineering Department
What do we gain? New Hardware Functionality o Software Defined Radio o Digital Signal Processing Hardware (FFT and FIR) o Cryptography Processing Hardware(AES and RSA) Better Performance o Microsoft accelerates datacenters with FPGA (Project Catapult) (Intel in talks to acquire Altera) Power Savings 3/18/2025 5 Electrical, Computer and Energy Engineering Department
Case Study 1 Possa et al. [2] o Implemented a 15th order FIR filter o Compared FIR software implementation to hardware implementation o Used Nios II 32-bit RISC embedded processor and Altera Cyclone III FPGA 3/18/2025 6 Electrical, Computer and Energy Engineering Department
Case Study 2 Che et al. [4] o Studied performance of GPU vs FPGA o Implemented 3 Applications Gaussian Elimination (Linear Algebra) DES (64-bit block FPGA took 83 cycles and GPU took 5.80 x 105 cycles) Needleman-Wunsch (DNA Sequencing) o Hardware Used Intel Xeon Processor (3.2 GHz, 4 Threads) NVIDIA GeForce 8800 GTX (128 stream processors at 575 MHz) Xilinx Virtex-II Pro (100 MHz) 3/18/2025 7 Electrical, Computer and Energy Engineering Department
Case Study 2 Cont. 3/18/2025 8 Electrical, Computer and Energy Engineering Department
FPGAs in cell phones! Industry moving towards SoCs o Off-the-shelf systems with FPGA s and ARM Cortex A9 coupled together on the same chip High Level Synthesis (HLS) tools gaining momentum o Built-in Libraries o Simple IP generation process o Test Environment Xilinx supports run-time reconfiguration o Partial Reconfiguration Technology [5] 3/18/2025 9 Electrical, Computer and Energy Engineering Department
Challenges 3/18/2025 10 Electrical, Computer and Energy Engineering Department
Run-time Reconfiguration Approaches 3/18/2025 11 Electrical, Computer and Energy Engineering Department
Option 1: Reconfigure the Entire FPGA Design Re-define and compile an FPGA design Deploy Advantage o Simple Disadvantages o Disrupt other running hardware o Must compile and then deploy (Expensive) o Slow 3/18/2025 12 Electrical, Computer and Energy Engineering Department
Option 2: Xilinx Partial Reconfiguration Hardware Developer wants to support design variants o Not enough space o Flexibility Hardware Developer Design o FIR Filter o Swap 15 tap filter with 20 tap filter Xilinx Design Flow o Define partitions in the FPGA o Define Reconfigurable Modules (Variants) o Place RMs into each partition, place & route, generate full and partial bitstreams 3/18/2025 13 Electrical, Computer and Energy Engineering Department
Option 2: Xilinx Partial Reconfiguration Advantage o Supported in Xilinx mainstream tools (Vivado) o Floor planning is done by hand, thus, area is used optimally Disadvantages o Not general, everything known at compile time o Everything dictated by the developer, thus, does not support multiple party involvement 3/18/2025 14 Electrical, Computer and Energy Engineering Department
But this is what we want! Support for multiple parties o Phone Manufacturers (Static Design) o Developers (Reconfigurable Modules) o User (End-System) Phone manufacture and developer do not have control of end system Phone Manufacturers Need a general deployment model! Developers User 3/18/2025 15 Electrical, Computer and Energy Engineering Department
Option 3: General RTR (Academic) Two main approaches to obtain general RTR o Run-time place & route o Pre-defined slots Many variations of the two approaches in decades of work 3/18/2025 16 Electrical, Computer and Energy Engineering Department
Run-time Place & Route 3/18/2025 17 Electrical, Computer and Energy Engineering Department
Run-time Place & Routing Developer provides source design End system handles place route at runtime to integrate into the design (more general) General Limitations o Tied to specific architectures o Not adopted in mainstream tools o Forced to reverse engineer bitstreams o Inefficient (Slow, doesn t always work, poor performance) 3/18/2025 18 Electrical, Computer and Energy Engineering Department
Run-Time Routing Related Work JBits API into Xilinx bitstream[11] o API to access Xilinx FPGA bitstream o Access to LUTs, routing, Flip-Flops JRoute [12] o API that routes Xilinx FPGA devices o Routing at run-time Slotless Module-Based Reconfiguration [13] o Compile-time vs Run-time o Allocate large sandbox (Place Modules Optimally) o BitShop 3/18/2025 19 Electrical, Computer and Energy Engineering Department
Slot-based Solutions 3/18/2025 20 Electrical, Computer and Energy Engineering Department
Slot-based Run-time Reconfiguration Pre-placed slots in the FPGA Download configuration bits to the slots General Limitations o Forced to reverse engineer bitstreams o Force heavy constraints on routing 3/18/2025 21 Electrical, Computer and Energy Engineering Department
Slot-based Related Work Horta et al. [14] 2002 o Dynamic Hardware Plugins (Slots) o PARBIT o Original bitfile o Target bitfile o Configuration Parameters o Routing constraints happen during static design bitstream generation 3/18/2025 22 Electrical, Computer and Energy Engineering Department
Our New Approach 3/18/2025 23 Electrical, Computer and Energy Engineering Department
Our Approach Use Xilinx mainstream tools to create a general RTR solution How? o Leverage deployment model of apps 3/18/2025 24 Electrical, Computer and Energy Engineering Department
Cloud RTR Approach Google Play Store The Cloud Compiler App4 + partial bitstream4 App1 + partial bitstream1 Manufacturer App2 + partial bitstream2 App3 + partial bitstream3 Consumer Developer 3/18/2025 25 Electrical, Computer and Energy Engineering Department
Our Contributions 3/18/2025 26 Electrical, Computer and Energy Engineering Department
Our Contributions Worked out nuances of Xilinx partial reconfiguration [5] o Developed tools to automate and simplify this process Built cloud-based system and did an evaluation of the brute-force method 3/18/2025 27 Electrical, Computer and Energy Engineering Department
Partial Reconfiguration 3/18/2025 28 Electrical, Computer and Energy Engineering Department
Background Terms Synthesis o HDL into a netlist (connecting gates or flip-flops) Place & Route (Implementation) o Process to place and map netlist to FPGA physical resources Partial Reconfiguration (PR) Reconfigurable Module (RM) Processing System (PS) o ARM Cortex A9 Programmable Logic (PL) o FPGA (Fabric) Out-of-context (OOC) Synthesis o Separate netlist for module o No optimizations 3/18/2025 29 Electrical, Computer and Energy Engineering Department
Define Static Design Hardware needed by RM o DMA o Reset Hardware I/O AXI GPU HDMI I n t e r c o n n e c t AXI Reset AXI PS AXI RM DMA AXI AXI 3/18/2025 30 Electrical, Computer and Energy Engineering Department
PR Clock Boundary Placing partitions on clock was not enough FPGA P P P P P Reset CPU RM AXI 3/18/2025 31 Electrical, Computer and Energy Engineering Department
Static Design Synthesis Before any PR can be done we need to instantiate black boxes in HDL static design Need to make sure interfaces match Static Design (HDL) Static Design (Gates) 2 3 1 3/18/2025 32 Electrical, Computer and Energy Engineering Department
Static Design Synthesis HDL generated by Vivado from block design Script to instantiate black boxes 3/18/2025 33 Electrical, Computer and Energy Engineering Department
Base PR Tool Flow 1. Configure black boxes to be reconfigurable Static Design (Gates) 3 R D 2 R D 1 R D 2. Load a design into each partition* 3. Floorplan each block on a clock boundary 4. Implement design (Place & Route) Base Implementation Checkpoint FPGA P P 5. Carve out partitions P P P P 6. Save base checkpoint Static Routed Checkpoint * Each design must be synthesized OOC so they can be loaded in the partition 3/18/2025 34 Electrical, Computer and Energy Engineering Department
Dynamic PR Tool Flow Static Routed Checkpoint P 1. Lock Design P P 2. Load RM into each partition 3. Implement design (Place & Route) Dynamic Implementation Checkpoint Base Implementation Checkpoint Verify 4. Verify P A S S 5. Generate Bitstreams/Binaries Partial Bitstream2 Partial Bitstream3 Full Partial Bitstream1 * Each design must be synthesized OOC so they can be loaded in the partition Bitstream 3/18/2025 35 Electrical, Computer and Energy Engineering Department
Evaluation 3/18/2025 36 Electrical, Computer and Energy Engineering Department
Experiments Experiment 1: Slot Placement o Place max # of FFT s o 1024 samples/frame o Single channel o 16-bit data width How many slots will be available to developer/user? Experiment 2: Base PR execution times o 2 8 FFT slots Is the brute force method practical? Experiment 3: Dynamic PR execution times o 2 8 FFT slots 3/18/2025 37 Electrical, Computer and Energy Engineering Department
Experiment 1 3/18/2025 38 Electrical, Computer and Energy Engineering Department
How many slots? Slot Placement o Place max # of FFT s o 1024 samples/frame o Single channel o 16-bit data width Resources Available o Slices o BRAM o DSPs Result o Placed 8 FFT s 3/18/2025 39 Electrical, Computer and Energy Engineering Department
FPGA Layout 3/18/2025 40 Electrical, Computer and Energy Engineering Department
FFT Slot Layout 3/18/2025 41 Electrical, Computer and Energy Engineering Department
Slot Design Layout Considerations What did we learn? o Best to allocate all three resources to slots o DSPs and/or Slices will likely be bottleneck o Potentially reduce number of slots and allocate more space o Hardware support will grow with RMs (Less slots means more space can be used) Hardware Most Used Resource % Utilization FFT IP LUTs 6.12 Reset IP LUTs 0.07 DMA IP LUTs 2.21 3/18/2025 42 Electrical, Computer and Energy Engineering Department
Experiment 2 3/18/2025 43 Electrical, Computer and Energy Engineering Department
Base PR Execution Time Intel Xenon 2.1 GHz (6 cores with 48 GB RAM) 2 8 Slots FFT o o o 1024 samples/frame Single channel 16-bit data width Note: This process is done once for each phone variant (or when manufacturer updates static design) 3/18/2025 44 Electrical, Computer and Energy Engineering Department
Base PR Execution Time 40.00 36.17 35.00 30.00 27.70 Execution Time (m) 25.00 22.02 20.00 16.57 15.00 12.25 9.18 10.00 6.02 5.00 0.00 2 3 4 5 6 7 8 # of Slots 3/18/2025 45 Electrical, Computer and Energy Engineering Department
Slot Design Layout Considerations What did we learn? o More area means more compilation time Placement of slots affect compilation time o Solution: Use a bigger FPGA or less FPGA area 3/18/2025 46 Electrical, Computer and Energy Engineering Department
Experiment 3 3/18/2025 47 Electrical, Computer and Energy Engineering Department
Dynamic PR Execution Time Intel Xenon 2.1 GHz (6 cores with 48 GB RAM) 2 8 Slots FFT o o o 1024 samples/frame Single channel 16-bit data width 3/18/2025 48 Electrical, Computer and Energy Engineering Department
Dynamic PR Execution Time 30.00 24.52 25.00 20.50 Execution Time (m) 20.00 15.31 15.00 11.22 10.00 8.21 5.00 0.00 2 3 4 5 6 # of Slots 3/18/2025 49 Electrical, Computer and Energy Engineering Department
Slot Design Layout Considerations Placement of slots affect compilation time more prevalent in dynamic PR flow Solution: Use a bigger FPGA or less FPGA area 3/18/2025 50 Electrical, Computer and Energy Engineering Department