Current ML-Related C++ Proposals on Matrix Extensions

Current ML-Related C++ Proposals on Matrix Extensions
Slide Note
Embed
Share

Proposal for extending C++ capabilities in linear algebra functions, matrices, and operators to better align with tensor hardware like Intel AMX, DPAS, and Nvidia Tensorcores. The extension aims to provide enhanced functionality for matrix operations, including load, store, and primitive operations such as MAD. The proposal also introduces a new matrix datatype with defined type, size, and layout policies, allowing for separate memory operations and layouts from computations.

  • C++
  • Linear algebra
  • Matrix extension
  • ML proposals
  • Tensor hardware

Uploaded on Apr 19, 2025 | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.

E N D

Presentation Transcript


  1. Matrix in C++ Dounia Khaldi Intel Corp. 04/14/2022 1

  2. Current ML-Related C++ Proposals P2553 mdspan C++23 Other than indexing and slicing, no operations associated with it P1684 mdarray: An Owning Multidimensional Array Analog of mdspan C++26 P1673: BLAS Linear Algebra Functions C++26 P1385: Linear Algebra Matrices and Operators Deferred Vector and matrix engines along with overloaded operators From Poll: Define engine/matrix classes in terms of mdspan + storage and mdspan concepts (e.g. extents), and expose an mdspan-esque interface. This implies that fs_ and dyn_ are combined into one template parameterized on extents (which are either static or dynamic). 2

  3. Motivation Proliferation of tensor hardware: Intel AMX, DPAS, Nvidia Tensorcores BLAS is higher level than what the hardware usage needs BLAS excludes ML specific functions such as convolutions and activation functions Load, store, mma primitives are missing in current proposals Matrix can be seen as a 2d version of mdarray (owning) DPC++ joint_matrix matches more mdspan/mdarray Initial operations: Load Store Mad Element indexing: T operator()(int I, int j) 3

  4. C++ Matrix Extension C++ Matrix Extension template <typename T, size_t Rows=std::dynamic_extent, size_t Cols=std::dynamic_extent, LayoutPolicy l = row_major> struct matrix; - New matrix datatype - Defined with a specified type, size, and layout - Explicit data transfers void matrix_load(matrix<>dst, T *base, unsigned stride, LayoutPolicy l = row_major); void matrix_store(matrix<>src, T *base, unsigned stride, LayoutPolicy l = row_major); - Separate memory operations and layout from the compute - Layout: row major, column major, pack (VNNI) - Extensible to add customed layouts (symmetric, tiled) matrix<> matrix_mad(matrix<>A, matrix<>B, matrix<>C); Extensible to add more operations 4

  5. C++ Matrix Extension Example C++ Matrix Extension Example int8_t memA[M][K] = {0}; int8_t memB[K][N] = {0}; int32_t memC[M][N] = {0}; //Assuming memB has already been VNNIed matrix<int8_t, tM, tK> tA; matrix<int8_t, tK, tN> tB; matrix<int32_t, tM, tN> tC; MKL call: cblas_dgemm(CblasColMajor, CblasNoTrans, CblasNoTrans, m, n, k, alpha, A, lda, B, ldb, beta, C, ldc); for (int i = 0; i < M; i += tM) for (int j = 0; j < N; j += tN) { matrix_load(tC, &memC +i*N+j, N, row_major); for (int k = 0; k < K; k += tK) { matrix_load(tA, &memA +i*K+k, K, row_major); matrix_load(tB, &memB +k*N+j, N, packed_b); tC = matrix_mad(tA, tB, tC); } matrix_store(tC, &memC +i*N+j, N, row_major); } 5

  6. Mapping C++ matrix interface to AMX Intrinsics C++ using matrix API Plain C using AMX Intrinsics int8_t memA[M][K] = {0}; int8_t memB[K][N] = {0}; int32_t memC[M][N] = {0}; //Assuming memB has already been VNNIed matrix<int8_t> tA(tM, tK); matrix<int8_t> tB(tK, tN); matrix<int32_t> tC(tM, tN); for (int i = 0; i < M; i += tM) for (int j = 0; j < N; j += tN) { matrix_load(tC, &memC +i*N+j, N, row_major); for (int k = 0; k < K; k += tK) { matrix_load(tA, &memA +i*K+k, K, row_major); matrix_load(tB, &memB +k*N+j,N,word_packed); tC = matrix_mad(tA, tB, tC); } matrix_store(tC, &memC +i*N+j, N, row_major); } int8_t memA[M][K] = {0}; int8_t memB[K][N] = {0}; int8_t memC[M][N] = {0}; //Assuming memB has already been VNNIed __tile1024i A[2] = {tM, tK}; __tile1024i B[2] = {tK, tN}; __tile1024i C[2] = {tM, tN}; Assuming tM < 32, tN <32, tK<64 for (int i = 0; i < M; i += tM) for (int j = 0; j < N; j += tN) { __tile_loadd(&C, &memC + i * N + j, N); for (int k = 0; k < K; k += tK) { __tile_loadd(&A, &memA + i * K + k, K); __tile_loadd(&B, &memB + k * N + j, N); __tile_dpbsud(&C, A, B); // 1 SG } __tile_stored(&memC + i * N + j, N, C); } 6

More Related Content