Efficient Matrix Multiplication Techniques for Real-to-Complex Field Conversion

blis matrix multiplication from real to complex n.w
1 / 63
Embed
Share

Explore efficient matrix multiplication techniques for converting from the real to complex field using BLIS framework. Learn about the gemm algorithm, the BLIS framework, and advancements in high-performance matrix multiplication. Collaboration details and acknowledgements included.

  • Matrix Multiplication
  • BLIS Framework
  • Complex Field
  • Efficient Algorithms
  • High Performance

Uploaded on | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.

E N D

Presentation Transcript


  1. BLIS Matrix Multiplication: from Real to Complex Field G. Van Zee 1

  2. Acknowledgements Funding NSF Award OCI-1148125: SI2-SSI: A Linear Algebra Software Infrastructure for Sustained Innovation in Computational Chemistry and other Sciences. (Funded June 1, 2012 - May 31, 2015.) Other sources (Intel, Texas Instruments) Collaborators Tyler Smith, Tze Meng Low 2

  3. Acknowledgements Journal papers BLIS: A Framework for Rapid Instantiation of BLAS Functionality (accepted to TOMS) The BLIS Framework: Experiments in Portability (accepted to TOMS pending minor modifications) Analytical Modeling is Enough for High Performance BLIS (submitted to TOMS) Conference papers Anatomy of High-Performance Many-Threaded Matrix Multiplication (accepted to IPDPS 2014) 3

  4. Introduction Before we get started Let s review the general matrix-matrix multiplication (gemm) as implemented by Kazushige Goto in GotoBLAS. [Goto and van de Geijn 2008] 4

  5. The gemm algorithm += 5

  6. The gemm algorithm NC NC += 6

  7. The gemm algorithm += 7

  8. The gemm algorithm KC KC += 8

  9. The gemm algorithm += 9

  10. The gemm algorithm += Pack row panel of B 10

  11. The gemm algorithm += Pack row panel of B NR 11

  12. The gemm algorithm += 12

  13. The gemm algorithm MC += 13

  14. The gemm algorithm += 14

  15. The gemm algorithm += Pack block of A 15

  16. The gemm algorithm += Pack block of A MR 16

  17. The gemm algorithm += 17

  18. Where the micro-kernel fits in for ( 0 to NC-1 ) for ( 0 to MC-1 ) for ( 0 to KC-1 ) // outer product endfor endfor endfor += 19

  19. Where the micro-kernel fits in for ( 0 to NC-1: NR ) for ( 0 to MC-1 ) for ( 0 to KC-1 ) // outer product endfor endfor endfor NR NR += 20

  20. Where the micro-kernel fits in for ( 0 to NC-1: NR ) for ( 0 to MC-1 ) for ( 0 to KC-1 ) // outer product endfor endfor endfor += 21

  21. Where the micro-kernel fits in for ( 0 to NC-1: NR ) for ( 0 to MC-1: MR ) for ( 0 to KC-1 ) // outer product endfor endfor endfor MR += MR 22

  22. Where the micro-kernel fits in for ( 0 to NC-1: NR ) for ( 0 to MC-1: MR ) for ( 0 to KC-1 ) // outer product endfor endfor endfor += 23

  23. The gemm micro-kernel for ( 0 to NC-1: NR ) for ( 0 to MC-1: MR ) for ( 0 to KC-1 ) // outer product endfor endfor endfor NR KC NR MR C += A B 24

  24. The gemm micro-kernel for ( 0 to NC-1: NR ) for ( 0 to MC-1: MR ) for ( 0 to KC-1: 1 ) // outer product endfor endfor endfor NR KC NR MR C += A B Typical micro-kernel loop iteration Load column of packed A Load row of packed B Compute outer product Update C (kept in registers) 0 1 2 3 00 10 20 30 01 11 21 31 02 12 22 32 03 13 23 33 0 1 2 3 += 25

  25. From real to complex HPC community focuses on real domain. Why? Prevalence of real domain applications Benchmarks Complex domain has unique challenges 26

  26. From real to complex HPC community focuses on real domain. Why? Prevalence of real domain applications Benchmarks Complex domain has unique challenges 27

  27. Challenges Programmability Floating-point latency / register set size Instruction set 28

  28. Challenges Programmability Floating-point latency / register set size Instruction set 29

  29. Programmability What do you mean? Programmability of BLIS micro-kernel Micro-kernel typically must be implemented in assembly language Ugh. Why assembly? Compilers have trouble efficiently using vector instructions Even using vector instrinsics tends to leave flops on the table 30

  30. Programmability Okay fine, I ll write my micro-kernel in assembly. It can t be that bad, right? I could show you actual assembly code, but This is supposed to be a retreat! Diagrams are more illustrative anyway 31

  31. Programmability Diagrams will depict rank-1 update. Why? It s the body of the micro-kernel s loop! Instruction set Similar to Xeon Phi Notation , , are elements of matrices A, B, C, respectively Let s begin with the real domain 32

  32. Real rank-1 update in assembly 4 elements per vector register Implements 4 x 4 rank-1 update 0:3 , 0:3 are real elements Load/swizzle instructions req d: LOAD BROADCAST Floating-point instructions req d: MULTIPLY ADD 0 1 2 3 BCAST 0 2 2 2 3 3 3 0 1 2 0 0 0 1 1 1 LOAD 1 2 3 3 0 1 2 MUL 3 02 12 22 03 13 23 00 10 20 01 11 21 32 33 30 31 ADD 02 12 22 00 10 20 01 11 21 03 13 23 32 33 30 31 33

  33. Complex rank-1 update in assembly 4 elements per vector register Implements 2 x 2 rank-1 update 0+i 1 , 2+i 3, 0+i 1, 2+i 3are complex elements Load/swizzle instructions req d: LOAD DUPLICATE SHUFFLE(within lanes ) PERMUTE(across lanes ) Floating-point instructions req d: MULTIPLY ADD SUBADD High values in micro-tile still need to be swapped (after loop) 0 1 2 3 DUP 0 1 2 0 LOAD DUP 0 0 2 2 2 0 PERM 3 1 1 1 3 3 3 1 2 0 SHUF 2 PERM 3 1 1 0 3 3 MUL 2 02 12 20 13 03 31 00 10 22 11 01 33 30 21 32 23 SUBADD 00 10 21 01 11 20 00 11 10+ 01 22 33 02 13 12+ 03 20 31 31 30 32+ 23 30+ 21 ADD 34

  34. Programmability Bottom line Expressing complex arithmetic in assembly Awkward (at best) Tedious (potentially error-prone) May not even be possible if instructions are missing! Or may be possible but at lower performance (flop rate) Assembly-coding real domain isn t looking so bad now, is it? 35

  35. Challenges Programmability Floating-point latency / register set size Instruction set 36

  36. Latency / register set size Complex rank-1 update needs extra registers to hold intermediate results from swizzle instructions But that s okay! I can just reduce MR x NR (micro-tile footprint) because complex does four times as many flops! Not quite: four times flops on twice data Hrrrumph. Okay fine, twice as many flops per byte 37

  37. Latency / register set size Actually, this two-fold flops-per-byte advantage for complex buys you nothing Wait, what? Why? 38

  38. Latency / register set size What happened to my extra flops!? They re still there, but there is a problem 39

  39. Latency / register set size What happened to my extra flops!? They re still there, but there is a problem ?? ?? + ?? ?? ?? ?? ?? ?? + ?? ?? + ?? ?? 40

  40. Latency / register set size What happened to my extra flops!? They re still there, but there is a problem ?? ?? + ?? ?? ?? ?? ?? ?? + ?? ?? + ?? ?? Each element must be updated TWICE 41

  41. Latency / register set size What happened to my extra flops!? They re still there, but there is a problem ?? ?? + ?? ?? ?? ?? ?? ?? + ?? ?? + ?? ?? Each element must be updated TWICE 42

  42. Latency / register set size What happened to my extra flops!? They re still there, but there is a problem ?? ?? + ?? ?? ?? ?? ?? ?? + ?? ?? + ?? ?? Each element must be updated TWICE 43

  43. Latency / register set size What happened to my extra flops!? They re still there, but there is a problem ?? ?? + ?? ?? ?? ?? ?? ?? + ?? ?? + ?? ?? Each element must be updated TWICE Complex rank-1 update = TWO real rank-1 updates 44

  44. Latency / register set size What happened to my extra flops!? They re still there, but there is a problem ?? ?? + ?? ?? ?? ?? ?? ?? + ?? ?? + ?? ?? Each element must be updated TWICE Complex rank-1 update = TWO real rank-1 updates Each update of still requires a full latency period 45

  45. Latency / register set size What happened to my extra flops!? They re still there, but there is a problem ?? ?? + ?? ?? ?? ?? ?? ?? + ?? ?? + ?? ?? Each element must be updated TWICE Complex rank-1 update = TWO real rank-1 updates Each update of still requires a full latency period Yes, we get to execute twice as many flops, but we are forced to spend twice as long executing them! 46

  46. Latency / register set size So I have to keep MR x NR the same? Probably, yes (in bytes) And I still have to find registers to swizzle? Yes 47

  47. Latency / register set size So I have to keep MR x NR the same? Probably, yes (in bytes) And I still have to find registers to swizzle? Yes RvdG This is why I like to live my life as a double. 48

  48. Challenges Programmability Floating-point latency / register set size Instruction set 49

  49. Instruction set Need more sophisticated swizzle instructions DUPLICATE (in pairs) SHUFFLE (within lanes) PERMUTE (across lanes) And floating-point instructions SUBADD (subtract/add every other element) 50

  50. Instruction set Number of operands addressed by the instruction set also matters Three is better than two (SSE vs. AVX). Why? Two-operand MULTIPLY must overwrite one input operand What if you need to reuse that operand? You have to make a copy Copying increases the effective latency of the floating-point instruction 51

More Related Content