
Recent Communication Optimizations in Charm++: Advancing Messaging Efficiency
"Explore recent advancements in communication optimizations within Charm++, including zero-copy entry methods, RDMA usage, and SHM transport benefits for improving performance and reducing memory footprint."
Download Presentation

Please find below an Image/Link to download the presentation.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.
You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.
E N D
Presentation Transcript
Recent Communication Optimizations in Charm++ Nitin Bhat Software Engineer Charmworks, Inc. 16th Annual Charm Workshop 2018 1
Agenda Existing Charm++ Messaging API Motivation Zero-copy Entry Method Send API using RDMA Zero-copy Direct API using RDMA Results Using SHM transport using CMA Results Summary 2
forcecalculations.ci Module forcecalculations{ ... array [1D] Cell { entry forces( ) ; entry void recv_forces (double forces [size], int size, double value); } ..... } Charm Interface File - Declarations forcecalculations.C void recv_forces(double * forces, int size, double value){ . } C++ Code File Entry method forcecalculations.C Cell_Proxy[n].recv_forces(forces, 1000000, 4.0); C++ Code File Call site 4
Regular Messaging API - What happens under the hood? Node 0 Node 1 Charm++. Charm++ ...... Cell_Proxy [n]. recv_force(forces, size, value); ....... void recv_force ( double * forces, int size, int value) { } forces forces value value Marshalling of Parameters Un-marshalling of Parameters size size Header value value size size forces Header RGET metadata Network Network 5
Motivation Memory system is the bottleneck Faster cores and Fatter nodes Processor performance has been scaling much better than memory performance over the years On RDMA/CMA enabled systems, avoid copies of the large buffer by minor changes in the application logic Advantages: Reduce memory footprint Improve performance by reducing memory allocation size and avoiding copy Reduce page faults, data cache misses 6
forcecalculations.ci Module forcecalculations{ ... array [1D] Cell { entry forces( ) ; entry void recv_forces (nocopy double forces [size], int size, double value); } ..... } Charm Interface File - Declarations forcecalculations.C void recv_forces(double * forces, int size, double value){ . } C++ Code File Entry method forcecalculations.C Callback Cb = new Callback(CkIndex_Cell::completed, cellArrayID); Cell_Proxy[n].recv_forces(CkSendBuffer(forces, cb), 1000000, 4.0); C++ Code File Call site 8
Zero-copy Entry Method Send API - What happens under the hood? Callback Node 0 Node 1 Charm++. Charm++ ...... Cell_Proxy [n]. recv_force(CkSendBuffer(forces, cb), size, value); ....... void recv_force ( double * forces, int size, int value) { } forces value value Marshalling of Parameters Un-marshalling of Parameters size size RGET Header Header size value size value forces Network Network 9
forcecalculations.ci Module forcecalculations{ ... array [1D] Cell { entry forces( ) ; entry void recv_forces (CkNcpySource src, int size, double value); } ..... } Charm Interface File - Declarations forcecalculations.C void recv_forces(CkNcpySource src, int size, double value) { Callback recv_cb = new Callback(CkIndex_Cell::recv_completed, cellArrayID); CkNcpyDestination dest(myForces, size*sizeof(double), recv_cb, CK_BUFFER_REG); dest.rget(src); } C++ Code File Entry method forcecalculations.C Callback send_cb = new Callback(CkIndex_Cell::send_completed, cellArrayID); CkNcpySource src(forces, size*sizeof(double), send_cb, CK_BUFFER_REG); Cell_Proxy[n].recv_forces(src, 1000000, 4.0); C++ Code File Call site 11
Zero-copy Direct API - What happens under the hood? Sender Callback Receiver Callback Node 1 Node 0 Charm++ Charm++. void recv_force (CkNcpySoruce src, int size, int value) { dest.rget(src); } ...... CkNcpySource src(forces, size*sizeof(double), send_cb); Cell_Proxy [n]. recv_force(src, size, value); ....... forces RGET value value myforces Marshalling of Parameters Un-marshalling of Parameters src size size Header Header src size value src size value Network Network 12
Modes of Operation in Direct API to support memory registration(gni, verbs, ofi) CK_BUFFER_UNREG - Default Mode Unregistered at the beginning Delayed registration if required CK_BUFFER_REG Registered by the API CK_BUFFER_PREREG Registered before the API call by allocating memory out of a pre-registered mempool CK_BUFFER_NOREG No registration 13
Results Results Pingpong Regular API vs Zerocopy Entry Method Send API Pingpong & Regular Send and Receive API vs Zerocopy Direct API 14
Results on BG/Q (Vesta) PAMI interconnect Message Size Send API (us) API (us) (upto 1.6x) Direct API % improvem ent -34.44 0.74 -28.10 0.78 -22.36 0.82 -15.04 0.87 -5.15 0.95 3.10 1.03 10.78 1.12 17.13 1.21 21.39 1.27 23.42 1.31 30.00 1.43 35.25 1.54 36.15 1.57 38.47 1.63 35.98 1.56 38.41 1.62 Regular Zerocopy EM Send ZC EM API % Improvement ZM EM Speedup Regular Send and Receive API (us) 35.57 38.27 42.03 48.74 61.49 86.15 135.77 235.53 434.52 831.49 1755.24 3718.02 7465.67 15539.09 33700.23 65988.34 Zerocopy Direct API (GET) (us) Direct API SpeedUp -98.97 -83.86 -74.55 -59.14 -46.08 -29.14 -13.49 -1.74 5.84 10.00 16.07 19.08 19.59 23.28 18.86 21.23 2 KB 4 KB 8 KB 16 KB 32 KB 64 KB 128 KB 256 KB 512 KB 1 MB 2 MB 4 MB 8 MB 16 MB 32 MB 64 MB 34.10 37.00 40.28 46.58 57.11 78.80 122.08 208.94 381.90 728.81 1484.07 3307.57 6569.11 13771.92 29246.51 57096.48 67.85 68.02 70.31 74.12 83.43 101.76 138.55 212.58 359.59 655.91 1245.52 2676.49 5282.12 10565.15 23730.00 44976.32 0.50 0.54 0.57 0.63 0.68 0.77 0.88 0.98 1.06 1.11 1.19 1.24 1.24 1.30 1.23 1.27 47.82 49.02 51.43 56.07 64.66 83.48 121.14 195.19 341.57 636.78 1228.63 2407.34 4767.12 9560.51 21573.57 40644.15 15
Results on Dell/Intel cluster (Golub) Infiniband Interconnect (upto 4.3x) Message Size Regular Send API (us) Zerocopy EM Send API (us) ZC EM API % Improveme nt -2655.45 -2220.20 -1671.76 -1190.74 -697.08 -439.24 -246.78 -119.77 -53.06 -6.04 4.50 7.41 40.33 54.74 27.91 29.67 ZM EM Speedup Regular Send and Receive API (us) 3.99 4.77 6.15 8.92 15.76 27.82 53.55 103.41 202.44 528.40 970.92 1878.60 7154.38 15631.23 28174.30 56955.59 Zerocopy Direct API (GET) (us) Direct API % improveme nt -54.26 -33.81 -18.96 -0.30 23.61 34.85 43.61 47.15 48.64 61.96 59.15 57.71 76.82 78.85 76.38 76.72 Direct API SpeedUp 2 KB 4 KB 8 KB 16 KB 32 KB 64 KB 128 KB 256 KB 512 KB 1 MB 2 MB 4 MB 8 MB 16 MB 32 MB 64 MB 4.15 4.98 6.52 9.30 15.64 24.67 43.53 81.57 159.56 397.62 760.64 1456.88 6428.19 13891.67 24835.79 50290.13 114.34 115.56 115.48 120.05 124.64 133.03 150.97 179.27 244.22 421.63 726.41 1348.92 3835.77 6287.78 17905.08 35370.92 0.04 0.04 0.06 0.08 0.13 0.19 0.29 0.46 0.65 0.94 1.05 1.08 1.68 2.21 1.39 1.42 6.15 6.38 7.32 8.95 12.04 18.13 30.20 54.65 103.98 201.00 396.63 794.44 1658.74 3305.39 6654.32 13259.62 0.65 0.75 0.84 1.00 1.31 1.53 1.77 1.89 1.95 2.63 2.45 2.36 4.31 4.73 4.23 4.30 16
Results on Crayxc (Edison) Gni Interconnect (upto 8.7x) Message Size Regular Send API (us) Zerocopy EM Send API (us) ZC EM API % Improveme nt -13345.67 0.01 -10314.84 0.01 -9475.61 -8585.87 -5086.38 -3734.09 -1981.94 -1047.70 -543.50 -234.57 -92.63 -66.76 -54.41 51.04 39.03 43.70 ZM EM Speedup Regular Send and Receive API (us) 4.50 5.51 5.99 7.82 13.16 26.59 49.09 95.08 205.05 372.59 828.88 1475.04 3342.99 18455.18 38212.89 81922.82 Zerocopy Direct API (GET) (us) Direct API % improveme nt -31.38 -7.04 -12.03 1.56 28.33 50.06 56.82 61.71 66.99 57.85 62.91 64.88 69.32 87.83 87.99 88.54 Direct API SpeedUp 2 KB 4 KB 8 KB 16 KB 32 KB 64 KB 128 KB 256 KB 512 KB 1 MB 2 MB 4 MB 8 MB 16 MB 32 MB 64 MB 4.38 5.54 5.86 6.76 10.95 15.72 30.08 56.59 108.59 226.92 475.25 913.32 1773.81 14835.30 26601.09 52790.98 589.26 576.78 560.79 587.23 568.08 602.76 626.32 649.52 698.80 759.19 915.49 1523.03 2738.94 7263.10 16218.50 29718.78 5.91 5.90 6.71 7.70 9.43 13.28 21.20 36.40 67.68 157.04 307.46 517.97 1025.65 2245.69 4589.43 9388.09 0.76 0.93 0.89 1.02 1.40 2.00 2.32 2.61 3.03 2.37 2.70 2.85 3.26 8.22 8.33 8.73 0.01 0.01 0.02 0.03 0.05 0.09 0.16 0.30 0.52 0.60 0.65 2.04 1.64 1.78 17
Results on Intel KNL cluster (Stampede2) Intel Omni-path Interconnect (upto 10x) Message Size Send API (us) API (us) Improveme nt 2 KB 16.79 55.91 0.30 4 KB 18.06 59.45 0.30 8 KB 21.23 68.65 -223.40 0.31 16 KB 24.69 74.33 -201.06 0.33 32 KB 30.39 75.55 0.40 64 KB 137.88 147.84 0.93 128 KB 179.06 205.41 0.87 256 KB 215.92 319.49 0.68 512 KB 207.66 336.76 0.62 1 MB 407.83 342.27 16.08 1.19 2 MB 736.41 383.23 47.96 1.92 4 MB 1376.30 560.89 59.25 2.45 8 MB 2811.16 831.74 70.41 3.38 16 MB 6008.41 1531.04 74.52 3.92 32 MB 23693.12 11775.96 50.30 2.01 64 MB 45585.29 21727.71 52.34 2.10 Regular Zerocopy EM Send ZC EM API % ZM EM Speedup Regular Send and Receive API (us) 16.96 18.61 21.46 25.80 33.41 154.33 191.62 195.90 323.97 605.58 1060.35 1901.06 6805.73 16454.11 29109.18 55920.52 Zerocopy Direct API (GET) (us) Direct API % improveme nt -113.31 -103.94 -95.95 -79.06 -49.58 62.89 19.07 16.97 52.40 67.83 76.55 76.15 88.51 90.89 90.08 89.87 Direct API SpeedUp -232.96 -229.14 36.18 37.95 42.05 46.19 49.97 57.28 155.07 162.64 154.20 194.84 248.68 453.40 781.65 1498.92 2888.36 5666.87 0.47 0.49 0.51 0.56 0.67 2.69 1.24 1.20 2.10 3.11 4.26 4.19 8.71 10.98 10.08 9.87 -148.57 -7.22 -14.72 -47.97 -62.17 18
Using SHM Transport using CMA Charm++ within-node communication between processes uses the network SHM skips the network Cross Memory Attach Linux 3.2 Implementation uses metadata message (sent through the network) followed by a process_vm_readv and ack message (sent through the network) 19
Results Results Pingpong Using the network vs Using SHM Transport over CMA Pingpong 20
Results on a lab machine with Ethernet network (upto 4x) Size (Bytes) No CMA one way time (us) CMA one way time (us) % improvement Speedup 1 KB 2 KB 4 KB 8 KB 16 KB 32 KB 64 KB 128 KB 256 KB 512 KB 1 MB 2 MB 4 MB 8 MB 16 MB 32 MB 5.58 5.93 6.27 7.56 11.55 19.87 36.31 66.57 130.52 254.47 500.50 1025.51 2321.18 4935.33 9703.12 21204.47 10.02 10.19 10.36 10.96 11.93 14.22 18.91 27.68 44.50 75.09 133.47 252.18 687.42 1850.31 3641.47 9358.97 0.56 0.58 0.61 0.69 0.97 1.40 1.92 2.40 2.93 3.39 3.75 4.07 3.38 2.67 2.66 2.27 -79.54 -71.97 -65.25 -45.00 -3.32 28.42 47.93 58.42 65.91 70.49 73.33 75.41 70.38 62.51 62.47 55.86 21
Results on Edison (GNI) (upto 1.5x) Size (Bytes) 256 Bytes 512 Bytes 1 KB 2 KB 4 KB 8 KB 16 KB 32 KB 64 KB 128 KB 256 KB 512 KB 1 MB 2 MB 4 MB 8 MB 16 MB 32 MB No CMA one way time (us) 1.39 1.43 3.56 3.46 3.69 4.10 5.16 7.23 11.41 19.80 36.71 70.28 137.14 270.96 561.46 1208.64 6156.18 10463.20 CMA one way time (us) 2.35 2.41 2.33 2.49 2.74 3.41 4.37 6.17 10.17 18.06 33.83 116.89 267.55 528.58 1060.39 2109.57 6654.44 12576.42 % improvement -68.78 -68.68 34.74 28.09 25.58 16.84 15.40 14.64 10.86 8.77 7.84 -66.33 -95.08 -95.08 -88.86 -74.54 -8.09 -20.20 Speedup 0.59 0.59 1.53 1.39 1.34 1.20 1.18 1.17 1.12 1.10 1.09 0.60 0.51 0.51 0.53 0.57 0.93 0.83 22
Results on Stampede2 (OFI) (upto 1.1x) Size (Bytes) No CMA one way time (us) CMA one way time (us) % improvement Speedup 1 KB 2 KB 4 KB 8 KB 16 KB 32 KB 64 KB 128 KB 256 KB 512 KB 1 MB 2 MB 4 MB 8 MB 16 MB 32 MB 6.05 6.44 9.34 10.12 18.83 23.81 40.18 55.79 86.60 190.23 353.27 619.59 1198.66 2334.56 4560.66 18086.00 15.02 15.62 16.01 17.28 19.63 24.27 35.81 52.16 76.35 166.52 336.50 621.30 1187.12 2358.88 4639.19 17088.52 0.40 0.41 0.58 0.59 0.96 0.98 1.12 1.07 1.13 1.14 1.05 1.00 1.01 0.99 0.98 1.06 -148.14 -142.47 -71.51 -70.82 -4.24 -1.93 10.89 6.51 11.84 12.46 4.75 -0.28 0.96 -1.04 -1.72 5.52 23
Results on Bridges (OFI) (upto 1.15 x) Size (Bytes) No CMA one way time (us) CMA one way time (us) % improvement Speedup 1 KB 2 KB 4 KB 8 KB 16 KB 32 KB 64 KB 128 KB 256 KB 512 KB 1 MB 2 MB 4 MB 8 MB 16 MB 32 MB 2.48 2.70 4.01 6.63 10.22 17.03 28.40 49.79 91.54 169.61 325.80 646.17 1293.16 2556.80 5148.79 14727.66 6.28 4.70 5.22 6.65 9.33 16.99 24.73 43.99 92.98 167.09 323.69 619.66 1252.15 2559.24 5219.44 14711.74 0.39 0.57 0.77 1.00 1.10 1.00 1.15 1.13 0.98 1.02 1.01 1.04 1.03 1.00 0.99 1.00 -153.55 -74.10 -30.14 -0.37 8.70 0.24 12.91 11.66 -1.57 1.49 0.65 4.10 3.17 -0.10 -1.37 0.11 24
Results on Bridges (MPI) (upto 1.08x) Size (Bytes) No CMA one way time (us) CMA one way time (us) % improvement Speedup 1 KB 2 KB 4 KB 8 KB 16 KB 32 KB 64 KB 128 KB 256 KB 512 KB 1 MB 2 MB 4 MB 8 MB 16 MB 32 MB 3.96 4.18 4.74 5.66 8.08 12.02 23.82 42.55 77.81 145.11 277.25 547.88 1086.36 2175.44 4378.83 13477.01 5.25 5.48 6.23 7.22 10.84 13.77 22.15 41.31 74.63 140.83 273.82 540.21 1078.66 2188.64 4421.36 13336.61 0.75 0.76 0.76 0.78 0.75 0.87 1.08 1.03 1.04 1.03 1.01 1.01 1.01 0.99 0.99 1.01 -32.63 -31.07 -31.42 -27.50 -34.11 -14.51 7.04 2.91 4.09 2.95 1.23 1.40 0.71 -0.61 -0.97 1.04 25
Summary Zero-copy EM API reduces sender side memory footprint and improves performance by avoiding large memory allocation and sender side copy Zero-copy Direct API reduces both sender and receiver sider memory footprint and improves performance to a larger extent by avoiding large memory allocation and copy on both sender side and receiver side copy CMA proves to be a faster alternative for intra-host inter-process communication to send messages avoiding the network. 26
Questions? 27