Liquid Metals and Lime: Exploring High-Performance Streaming Hardware Synthesis

Liquid Metals and Lime: Exploring High-Performance Streaming Hardware Synthesis
Slide Note
Embed
Share

Liquid Metals and Lime technologies enable efficient streaming hardware synthesis, offering dynamic customization, JIT compilation, and abstract programming principles to address power demands and enhance parallelism in computer applications. These advancements in hardware design aim to meet the growing need for versatile and reconfigurable computing solutions in the era of pervasive streaming and mobile content.

  • Liquid Metals
  • Lime
  • Streaming Hardware
  • JIT Compilation
  • Parallelism

Uploaded on Feb 17, 2025 | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.

E N D

Presentation Transcript


  1. 1 Liquid Metal s OPTIMUS: Synthesis of Efficient Streaming Hardware Scott Mahlke (University of Michigan) and Rodric Rabbah (IBM) DAC HLS Tutorial, San Francisco 2009

  2. Dynamic Application Specific Customization of HW 2 JIT compiler configures logic your app here Inspired by ASIC paradigm: High Performance Low Power

  3. Liquid Metal: JIT the Hardware 3 Single language for programming HW & SW Run in a standard JVM, or synthesize in HW Fluidly move computation between HW & SW Do for HW (viz. FPGAs) what FORTRAN did for computing Address critical technology trends Power address impractical growth of power and cooling demands enabling million way parallelism vs. small scale multicores in the field & on the fly customization to end- user applications demand for pervasive streaming and mobile content (WWW, multimedia, gaming) ASIC-like Architecture Versatility Reconfigurable Applications

  4. Lime: the Liquid Metal Language 4 Design Principles: Object-oriented, Java-like, Java-compatible Raise level of abstraction Parallel constructs that simplify code Target synthesis while retaining generality

  5. 4 reasons not another *C to HDL approach Emphasis on programmer productivity Leverage rich Java IDEs, libraries, and analysis Not an auto-parallelization approach Lime is explicitly parallel and synthesizable Fast fail-safe mechanism Lime may be refined into parallel SW implementation Intrinsic opportunity for online optimizations Static optimizations with dynamic refinement

  6. Lime Overview HW (FPGA): Lime: 6 Tasks, Value types Computation is well encapsulated Streaming primitives Data-flow driven computation Rate matching operators Multiple clock domains Ordinal-indexed arrays, bounded loops Template-like Generics Memory usage statically determined before layout Bit-level control and reasoning Abstract OO programming down to the bit-level!

  7. Streams: Exposing Computational Structure 7 Stream primitives are integral to the language Tasks in streams are strongly isolated Only the endpoints may perform side-effects Provide macro-level functional programming abstraction While allowing traditional imperative programming inside

  8. A Brief Introduction to Stream Operations 8 A finite stream literal: int stream s1 = { 1, 1, 2, 3, 5, 8 }; An infinite stream of 3 s: int stream s2 = task 3; Stream expressions: int stream s3 = s2 * 17; double stream s4 = Math.sin(s1); double stream s5 = s3 + s4; These operations create and connect tasks. Execution occurs later: lazy computation, functional.

  9. Simple Audio Processing 9 value int[] squareWave(int freq, int rate, int amplitude) { int wavelength = rate / freq; int[] samples = new int[wavelength]; for (int s: 1::wavelength) samples[s] = (s <= wavelength/2) ? 0 : amplitude; return (value int[]) samples; } int stream sqwaves = task squareWave(1000, 44100, 80)); task AudioSink(44100).play(sqwaves);

  10. Liquid Metal Tool Chain Lime 10 Quicklime Front-End Compiler Streaming IR FPGA Model Optimus Back-End Crucible Back-End Compiler Compiler HDL C Xilinx VHDL Compiler Cell SDK Xilinx bitfile Cell binary LM VM Virtex5 FPGA Cell BE LM VM LM VM

  11. Streaming Intermediate Representation (SIR) 11 Task: Pipeline: SplitJoin: Switch: joiner splitter switch joiner Feedback Loop: Task may be stateless or have state Task mapped to module with FIFO I/O Task graphs are hierarchical & structured joiner splitter

  12. SIR Compiler Optimizations 12 Task fusion and fission load balancing, scalability Stream buffer allocation locality enhancing, manage cache footprint or SRAM and control logic complexity Data access fusion reduce critical path length, improve communication-to-computation balance Address FPGA compilation challenges Finite, non-virtualizable device Complex optimization space Throughput, latency, power, area Very long synthesis times (minutes-hours)

  13. Preliminary Liquid Metal Results on Energy Consumption: FPGA vs PPC 405 13 0.8 2.25 ~1.4 ~1.4 ~1.4 Fraction of PowerPC Energy 0.6 0.4 0.2 0 DES Bubble Sort FFT Matrix Multiply Discrete Cosine Merge Sort Matrix Block Parallel Adder Transfrom Multiply Liquid Metal on Virtex 4 FPGA, 1.6W C reference implementation on PPC 405, 0.5W

  14. Preliminary Liquid Metal Results on Parallelism: FPGA vs PPC 405 14 Liquid Metal on Virtex 4 FPGA, 1.6W C reference implementation on PPC 405, 0.5W

  15. Handel-C Comparison 15 Compared DES and DCT with hand-optimized Handel-C implementation Performance 5% faster before optimizations 12x faster after optimizations Area 66% larger before optimizations 90% larger after optimizations

  16. Overview 16 Compilation Flow Scheduling Optimizations Results

  17. Top Level Compilation 17 ix i1 i0 M0 Controller Filter Work M1 Init Controller Source Init Work a[ ] Mn i . . . A Controller O0 O0 Om Round-Robin Splitter(8,8,8,8) Work Source B C D E Controller Controller Controller Controller Filter Filter Filter Filter A Round-Robin Splitter(8,8,8,8) Work Work Work Work B C E D F G H I Controller Round-Robin Joiner(1,1,1,1) Work Filter Filter Filter Filter J G F H I Round-Robin Joiner(1,1,1,1) Controller Sink J Work Sink

  18. Filter Compilation 18 bb1 Register Live data ins sum = 0 i = 0 Control Token 1 Control in Ack mux mux Memory/Queue ports Live out Data bb2 FIFO Read 2 temp = pop( ) Basic Block Register Control Token Ack Register Live data outs bb3 sum = sum + temp i = i + 1 Branch bb2 if i < 8 3 Register Ack Control outs Live out Data Control Token Ack push(sum) 4 bb4 FIFO Write Register

  19. Operation Compilation 19 temp sum i 1 im i0 1 1 ADD ADD sum = sum + temp i = i + 1 Branch bb2 if i < 8 predicate 8 FU temp 1 CMP o0 on Control out 4 Control out 3 Register Control in

  20. Static Stream Scheduling 20 Each queue has to be deep enough to hold values generated from a single execution of the connected filter Push 2 Filter 1 Double buffering is needed Buffer access is non-blocking A controller module is needed to orchestrate the schedule Pop 3 Controller uses finite state machine to execute the steady state schedule Filter 2 20

  21. Greedy Stream Scheduling 21 Filters fire eagerly. Blocking channel access. Filter 1 Allows for potentially smaller channels Filter 2 Controller is not needed Results produced with lower latency.

  22. Latency Comparison 22 18 16 Latency of Static Relative to Greedy 14 12 10 8 6 4 2 0 FFT Average DES Transfrom Matrix Multiply Bubble Sort Parallel Adder Merge Sort Matrix Block Discrete Cosine Multiply

  23. Area Comparison 23 100 Circuits with static scheduler Circuits with greedy scheduler 90 80 70 % of FPGA Area 60 50 40 30 20 10 0 FFT Average DES Transfrom Matrix Multiply Bubble Sort Merge Sort Parallel Adder Matrix Block Discrete Cosine Multiply

  24. Optimizations 24 Streaming optimizations (macro functional) Channel allocations, Channel access fusion, Critical Path Balancing, Filter fission and fusion, etc. Doing these optimization needs global information about the stream graph Typically performed manually using existing tools Classic optimizations (micro functional) Flip-flop elimination, Common subexpression elimination, Constant folding, Loop unrolling, etc. Typically included in existing compilers and tools

  25. Channel Allocation 25 Larger channels: More SRAM More control logic Less stalls Interlocking makes sure that each filter gets the right data or blocks. What is the right channel size?

  26. Channel Allocation Algorithm 26 Set the size of the channels to infinity. Warm-up the queues. Record the steady state instruction schedules for each pair. Unroll the schedules to have the same number of pushes and pops. Find the maximum number of overlapping lifetimes.

  27. Channel Allocation Example 27 Producer Consumer ---- ---- ---- Source push ---- ---- pop push ---- Filter 1 ---- ---- push ---- push pop Filter 2 push ---- ---- pop ---- pop Sink push pop pop Max overlap = 3

  28. Channel Allocation 28 0.8 0.7 Relative Channel Size After Optimization 0.6 0.5 0.4 0.3 0.2 0.1 0 DES FFT Multiply Bubble Sort Transfrom Average Matrix Matrix Block Merge Sort Parallel Discrete Adder Cosine Multiply

  29. Channel Access Fusion 29 Each channel access (push or pop) takes one cycle. Communication to computation ratio Longer critical path latency Limit task-level parallelism

  30. Channel Access Fusion Algorithm 30 Clustering channel access operations Loop Unrolling Code Motion Balancing the groups Similar to vectorization Wide channels w w r w r r r r r r r r w r w w Write Mult. = 1 Write Mult. = 8 Write Mult. = 4 30 Read Mult. = 8 Read Mult. = 8 Read Mult. = 1

  31. Access Fusion Example 31 int sum = 0; for (int i = 0; i < 32; i++) sum+ = pop(); push(sum); int sum = 0; int t1, t2, t3, t4; for (int i = 0; i < 8; i++) { (t1, t2, t3, t4) = pop4(); sum+ = t1 + t2 + t3 + t4; } push(sum); } } Some caveats int sum = 0; for (int i = 0; i < 8; i++) { sum+ = pop(); sum+ = pop(); sum+ = pop(); sum+ = pop(); } pop(); pop(); push(sum); int sum = 0; for (int i = 0; i < 32; i++) sum+ = pop(); pop(); pop(); push(sum);

  32. Access Fusion 32 8 7 6 Speedup (x100%) 5 4 3 2 1 0 DES Bubble Sort FFT Discrete Cosine Matrix Multiply Average Merge Sort Matrix Block Parallel Adder Transfrom Multiply

  33. Critical Path Balancing 33 Critical path is set by the longest combinational path in the filters Optimus uses its internal FPGA model to estimate how this impacts throughput and latency Balancing Algorithm: Optimus take target clock as input Start with least number of basic blocks Form USE/DEF chains for the filter Use the internal FPGA model to measure critical path latency Break the paths whose latency exceeds the target

  34. Critical Path Balancing Example 34 1 1 Add Add Mul 1 Mul Mul Mul Add Mul Mul 2 Mul Mul Sub Sub Mul Add Sub 1 2 Sub Add Sub Add Add Sub Add Sub Add Sub Add Sub Add Sub Add 3 Mul Mul Operation Delay 3 4 Shift Add/Sub 4 Add Add Shift 2 Shift Shift Multiply 10

  35. Liquid Metal 35 Interdisciplinary effort addressing the entire stack One language for programming HW (FPGAs) and SW Liquid Metal VM: JIT the hardware! Program all with Lime LiquidMetal VM CPU ??? FPGA GPU Multicore

  36. Streaming IR Expose structure: computation and communication Uniform framework for pipeline and data parallelism Canonical representation for stream-aware optimizations

  37. Streaming Optimizations Macro-functional Micro-functional Fold streaming IR graphs into FPGA Fusion, fission, replication Micro-pipelining Channel allocation Access fusion Flip-flop elimination subject to latency, area, and throughput constraints

  38. Ongoing Effort 38 Application development Streaming for enterprise and consumer Real-time applications Compiler and JIT Pre-provisioning profitable HW implementations Runtime opportunities to JIT the HW Advanced dynamic reconfiguration support in VM Predictive, hides latency New platforms Tightly coupled, higher bandwidth, lower latency communication Heterogeneous MPSoC systems FPGA + processors

Related


More Related Content