Implementing Matrix Multiplication in Data Flow Technology

matrix multiplication implemented in data flow n.w
1 / 17
Embed
Share

Learn how Aleksandar Milinkovi from Belgrade University implemented matrix multiplication using data flow technology. Discover the shift from control to data flow, resolving dependencies at compile time, and achieving parallel computations with deep pipelines. Explore the applicability of data flow in various fields like bioinformatics and the goal of realizing matrix multiplication on FPGA.

  • Matrix Multiplication
  • Data Flow Technology
  • Belgrade University
  • Electrical Engineering
  • Parallel Computation

Uploaded on | 3 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.

E N D

Presentation Transcript


  1. Matrix multiplication implemented in data flow technology Aleksandar Milinkovi Belgrade University, School of Electrical Engineering amilinko@gmail.com Aleksandar Milinkovi Belgrade University, School of Electrical Engineering

  2. Introduction Problem with big data Need to change computing paradigm Data flow instead of control flow Achieved by construction of graph Graph nodes (vertices) perform computations Each node is one deep pipeline Aleksandar Milinkovi Belgrade University, School of Electrical Engineering

  3. Dataflow computation Dependencies are resolved at compile time No new dependencies are made The whole mechanism is in deep pipeline Pipeline levels perform parallel computations Data flow produces one result per cycle Aleksandar Milinkovi Belgrade University, School of Electrical Engineering

  4. Matrix multiplication Data flow doesn t suit all situations However, it is applicable in lot of cases: Partial differential equations 3D finite differences Finite elements method Problems in bioinformatics, etc. Most of them contain matrix multiplications Goal: realization on FPGA, using data flow Aleksandar Milinkovi Belgrade University, School of Electrical Engineering

  5. Project realizations Two solutions: Maximal utilization of on-chip matrix part Matrices with small dimensions Matrices with large dimensions Multiplication using parallel pipelines Aleksandar Milinkovi Belgrade University, School of Electrical Engineering

  6. Good chip utilization A X02 ... X0(n-2) X00 X01 y02 ... y0(n-2) X0(n-1) y00 y01 y0(n-1) Pipe 0 X10 X11 X12 ... X1(n-2) X1(n-1) y10 y11 y12 ... y1(n-2) y1(n-1) ... ... ... y(n-2)0 y(n-2)1 y(n-2)2 ... y(n-2)(n-2) y(n-2)(n-1) X(n-2)0 X(n-2)1 X(n-2)2 X(n-2)(n-2) X(n-2)(n-1) Pipe 1 X(n-1)0 X(n-1)1 X(n-1)2 ... X(n-1)(n-2) X(n-1)(n-1) y(n-1)0 y(n-1)1 y(n-1)2 ... y(n-1)(n-2) y(n-1)(n-1) FMem capacity Set of columns on the chip until they are fully used Every pipe calculates 48 sums at the time Equivalent to 2 processors with 48 cores Additional parallelization possible Aleksandar Milinkovi Belgrade University, School of Electrical Engineering

  7. Good chip utilization A Aleksandar Milinkovi Belgrade University, School of Electrical Engineering

  8. Good chip utilization A Chip utilization and acceleration LUTs: 195345/297600 (65,64%) FFs: BRAMs: 778/1064 (73.12%) DSPs: 996/2016 (49,40%) 290689/595200 (48.83%) Matrix: 2304 x 2304 Intel: 42.5 s MAX3: 2.38 s Acceleration at kernel clock 75 MHz: 18 x Aleksandar Milinkovi Belgrade University, School of Electrical Engineering

  9. Good chip utilization B FMem capacity X00 X00 X00 X01 X01 X02 ... X0(n-2) X0(n-1) y00 y01 y02 ... y0(n-2) y0(n-1) Pipe 0 X10 X11 X12 ... X1(n-2) X1(n-1) y10 y11 y12 ... y1(n-2) y1(n-1) ... ... X(n-2)0 X(n-2)1 X(n-2)2 ... X(n-2)(n-2) X(n-2)(n-1) y(n-2)0 y(n-2)1 y(n-2)2 ... y(n-2)(n-2) y(n-2)(n-1) Pipe 1 X(n-1)0 X(n-1)1 X(n-1)2 ... X(n-1)(n-2) X(n-1)(n-1) y(n-1)0 y(n-1)1 y(n-1)2 ... y(n-1)(n-2) y(n-1)(n-1) Part of matrix Y is on chip during computation Each pipe calculates 48 sums at the time Equivalent to 2 processors with 48 cores Aleksandar Milinkovi Belgrade University, School of Electrical Engineering

  10. Good chip utilization B Aleksandar Milinkovi Belgrade University, School of Electrical Engineering

  11. Good chip utilization B Chip utilization and acceleration LUTs: 201237/297600 (67,62%) FFs: BRAMs: 782/1064 (73.50%) DSPs: 1021/2016 (50,64%) 302742/595200 (50.86%) Matrix: 4608 x 4608 Intel: 1034 s MAX3: 58.41 s Matrix: 2304 x 2304 Intel: 42.5 s MAX3: 2.38 s Acceleration at kernel clock 75 MHz: 18x Aleksandar Milinkovi Belgrade University, School of Electrical Engineering

  12. Multiple parallel pipelines Pipe 0 Pipe 2 Pipe 46 Pipe 47 Pipe 1 X02 ... X0(n-2) X00 X01 y02 ... y0(n-2) X0(n-1) y00 y01 y0(n-1) X10 X11 X12 ... X1(n-2) X1(n-1) y10 y11 y12 ... y1(n-2) y1(n-1) ... ... ... y(n-2)0 y(n-2)1 y(n-2)2 ... y(n-2)(n-2) y(n-2)(n-1) X(n-2)0 X(n-2)1 X(n-2)2 X(n-2)(n-2) X(n-2)(n-1) X(n-2)(n-1) X(n-1)0 X(n-1)1 X(n-1)2 ... X(n-1)(n-2) X(n-1)(n-1) y(n-1)0 y(n-1)1 y(n-1)2 ... y(n-1)(n-2) y(n-1)(n-1) Matrices are exclusively in a big memory Each pipe calculates one sum at the time Equivalent to 48 processors with one core Aleksandar Milinkovi Belgrade University, School of Electrical Engineering

  13. Multiple parallel pipelines Aleksandar Milinkovi Belgrade University, School of Electrical Engineering

  14. Multiple parallel pipelines Chip utilization and acceleration LUTs: 166328/297600 (55,89%) FFs: BRAMs: 430/1064 (40.41%) DSPs: 489/2016 (24,26%) 248047/595200 (41.67%) Matrix: 4608 x 4608 Intel: 1034 s MAX3: 98,48 s Matrix: 2304 x 2304 Intel: 42.5 s MAX3: 4,08 s Acceleration at kernel clock 150 MHz: > 10x Aleksandar Milinkovi Belgrade University, School of Electrical Engineering

  15. Comparison of solutions First solution: Good chip utilization Shorter execution time Drawback: matrices up to 8GB Second solution: matrices up to 12GB Drawback: longer execution time Aleksandar Milinkovi Belgrade University, School of Electrical Engineering

  16. Conclusions Matrix multiplication is operation with complexity O(n3) Part of complexity moved from time to space That produces acceleration (shorter execution time) Achieved by application of data flow technology Developed using tool chain from Maxeler Technologies Calculations order of magnitude faster than Intel Xeon Aleksandar Milinkovi Belgrade University, School of Electrical Engineering

  17. Matrix multiplication implemented in data flow technology Aleksandar Milinkovi Belgrade University, School of Electrical Engineering amilinko@gmail.com Aleksandar Milinkovi Belgrade University, School of Electrical Engineering

More Related Content