Early Performance Evaluation of Lattice QCD on POWER+GPU Cluster

Download Presenatation
early performance evaluation of lattice n.w
1 / 20
Embed
Share

Explore the early performance evaluation of lattice Quantum Chromodynamics on the POWER+GPU cluster, showcasing the power gains, IBM POWER8 and Nvidia Tesla GPU integration, overview of POWER8 processor, development environment, performance considerations, and data structure comparison.

  • Evaluation
  • Lattice QCD
  • POWER+GPU
  • Cluster
  • IBM

Uploaded on | 1 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.

E N D

Presentation Transcript


  1. Early Performance Evaluation of Lattice QCD on POWER+GPU Cluster Jun Doi (doichan@jp.ibm.com) IBM Research Tokyo 17 July 2015

  2. POWER Gains New Power of Computation POWER + GPU OpenPOWER foundation Open architecture Collaborative development of system & software Nvidia s GPU Now supports POWER processors Next generation supercomputers CORAL collaboration More than 100 Peta-Flops, SUMMIT and SIERRA NVLINK Faster data link between hosts and GPUs 2

  3. IBM POWER8 and Nvidia Tesla GPU The first IBM POWER product with GPU IBM Power System S824L 2 sockets of POWER8 processors per node Supports up to 2 Tesla GPUs per node Connected via PCI express (Gen3 x16) Tesla K40 or K80 Linux and little endian support Ubuntu 14.04 3

  4. Overview of POWER8 processor 12 cores per socket @3.02 GHz (S824L) 8 hardware threads (SMT) per core 96 threads per socket (192 threads per node) SIMD instructions (VMX/VSX) 2x 2-way double precision FMA instruction per cycle (2x 4-way for single precision) 289.92 GFlops (double) / 579.84 GFlops (single) per socket (@3.02 GHz) 4

  5. Development Environment of POWER+GPU systems Very similar environment to usual x86+GPU Linux cluster POWER processor supports little endian Most of software runs as similar to x86 clusters Compilers gcc : can optimize for POWER8 IBM XL compilers : strong optimizer for C/C++ and FORTRAN CUDA7.0 5

  6. Performance Consideration of Wilson-Dirac Operator ( ) ( ) ( ( = 1 Per lattice site: 1,488 flops (1,464 flops for even-odd preconditioning) ) U ) ( + ) U 4 ( ) = + + ) t 1 ( ) 1 ( D n n n n n n Double: 2,688 byte load, 192 byte store Single: half of above Double: 2.06 byte/flop (2.03 even-odd) Single: 1.03 byte/flop (1.01 even-odd) Memory bandwidth bottleneck Byte/flop POWER8 Tesla K40 Double 0.66 0.20 Single 0.33 0.067 Reducing memory access is important for performance (Especially for GPUs)

  7. Data Structure Comparison AoS (Array of Structures) Better for cache optimization on CPU For POWER8 1 structure of 3x4 spinor ... Spin4_3[0] Spin4_2[0] Spin1_1[0] Spin2_3[0] ... Spin2_1[0] Spin2_1[1] Spin2_2[0] Spin1_2[0] Spin1_2[1] Spin1_3[0] Spin1_3[1] Spin4_3[1] Spin4_2[1] Spin1_1[1] Array ... Elements: complex numbers ... Spin1_1[N-1] Spin1_2[N-1] Spin1_3[N-1] Better for coalescing on GPU SoA (Structure of Arrays) For GPU Array of 1 element of 3x4 spinor ... ... ... Spin1_1[1] Spin1_2[1] Spin1_1[0] Spin1_2[0] Spin1_1[2] Spin1_2[2] Spin1_1[N-1] Spin1_2[N-1] 3x4 spinor elements Spin1_3[1] ... ... Spin1_3[2] Spin1_3[0] Spin2_1[0] Spin2_2[0] Spin1_3[N-1] Spin2_3[0] ...

  8. Offloading Wilson-Dirac Operator to GPU All the data is ready to use on GPU Limit the problem size to fit in GPU memory Gauge matrices are transferred previously on GPU Spinor fields used in BiCGStab solver are transferred and allocated previously on GPU No data transfer between host and GPU (except for boundary exchange needed for parallelization) Multi GPU offloading There is no support for GPU direct (p2p copy) GPU to GPU transfer is performed via host memory Single thread / Multi-GPU execution 8

  9. Optimization of Wilson-Dirac Operator on GPU Warp size fitting Setting the number of threads per thread block to multiple of 32 (Warp size) Defined by size of X : Least common multiple of X and 32 E.g.) For X=48, using 2*48 = 96 threads (3 warps) Gauge field parameterization Loading 6 elements of SU(3) gauge matrix Reconstruct 3 elements by calculation 42 extra flops, 2/3 memory access to gauge field = C A a a a 0 1 2 B b b b ( ) 0 1 2 = C A B c c c 0 1 2 9

  10. Parallelization of Lattice QCD on GPU Cluster X-dimension We do not divide lattice in X dimension To avoid non-sequential access on the inner-most boundary Y,Z and T dimensions POWER8 Node POWER8 Node We divide lattice in T dimension by 2 GPUs in node GPU0 GPU1 GPU0 GPU1 Also we divide lattice in T by nodes, and in Z and Y dimension by nodes POWER8 Node POWER8 Node GPU0 GPU1 GPU0 GPU1

  11. Offloading and Data Transfer of Wilson-Dirac Operator Gauge field Spinor (1) Making half-spinor of boundary (2) GPU-> Host transfer Send buffer Send buffer (4) Inner calculation (3) Node-to-node transfer(MPI) Recv. buffer (5) Host->GPU transfer Recv. buffer (6) Boundary calculation GPU Host Spinor (output)

  12. Asynchronous Data Transfer Using CUDA Streams (Number of parallel dimensions)x2+1 streams to execute asynchronously Y- Y+ Z- Z+ T- T+ T- Z- T+ Y- Inner calculation Z+ Inner Y+ Boundary calculation GPU-Host transfer MPI transfer (Async) Making half-spinors Plus direction Matrix multiply Minus direction Minus direction Matrix multiply Plus direction

  13. Testing Environment IBM Power System S824L GPU GPU P8 P8 Tesla K40 Tesla K40 4 nodes, connected via Infiniband CPU IBM POWER8 12 3.02 GHz 289.92 GFlops 579.84 GFlops 512 GB 192 GB/s GPU Nvidia Tesla K40 2,880 0.745 GHz 1,430 GFlops 4,290 GFlops 12 GB 288 GB/s # of cores Clock speed Peak perf.: Double Single Memory Memory Bandwidth Ubuntu 14.04.1, CUDA Toolkit version 7.0

  14. POWER8 Performance: Strong Scaling (Double) 24x24x24x48 32x32x32x64 48x48x48x96 24x24x24x48 32x32x32x64 48x48x48x96 400 400 Sustained performance [GFlop/s] Sustained performance [GFlop/s] 350 350 Even-Odd preconditioned Wilson-Dirac 300 300 250 250 200 200 150 150 100 100 50 50 0 0 1 2 3 4 1 2 3 4 Number of nodes 32x32x32x64 Number of nodes 32x32x32x64 24x24x24x48 48x48x48x96 24x24x24x48 48x48x48x96 Sustained performance [GFlop/s] Sustained performance [GFlop/s] 300 300 BiCGStab 250 250 200 200 150 150 100 100 50 50 0 0 1 2 3 4 1 2 3 4 14 Number of nodes Number of nodes

  15. POWER8 Performance: Weak Scaling (Double) 24x24x24x24n 32x32x32x32n 48x48x48x48n 24x24x24x24n 32x32x32x32n 48x48x48x48n 400 400 Sustained performance [GFlop/s] Sustained performance [GFlop/s] 350 350 Even-Odd preconditioned Wilson-Dirac 300 300 250 250 200 200 150 150 100 100 50 50 0 0 1 2 4 1 2 4 Number of nodes 32x32x32x32n Number of nodes 32x32x32x32n 24x24x24x24n 48x48x48x48n 24x24x24x24n 48x48x48x48n Sustained performance [GFlop/s] Sustained performance [GFlop/s] 300 300 BiCGStab 250 250 200 200 150 150 100 100 50 50 0 0 1 2 4 1 2 4 15 Number of nodes Number of nodes

  16. K40 Performance: Strong Scaling (Double) 24x24x24x48 32x32x32x64 48x48x48x96 24x24x24x48 32x32x32x64 48x48x48x96 1200 Sustained performance [GFlop/s] 1200 Sustained performance [GFlop/s] Even-Odd preconditioned 1000 1000 Wilson-Dirac 800 800 600 600 400 400 200 200 0 0 1 2 3 4 1 2 3 4 Number of nodes 32x32x32x64 Number of nodes 32x32x32x64 24x24x24x48 48x48x48x96 24x24x24x48 48x48x48x96 Sustained performance [GFlop/s] Sustained performance [GFlop/s] 800 800 BiCGStab 600 600 400 400 200 200 0 0 1 2 3 4 1 2 3 4 16 Number of nodes Number of nodes

  17. K40 Performance: Weak Scaling (Double) 24x24x24x24n 32x32x32x64n 48x48x48x48n 24x24x24x24n 32x32x32x64n 48x48x48x48n 1200 Sustained performance [GFlop/s] 1200 Sustained performance [GFlop/s] Even-Odd preconditioned 1000 1000 Wilson-Dirac 800 800 600 600 400 400 200 200 0 0 1 2 4 1 2 4 Number of nodes 32x32x32x64n Number of nodes 32x32x32x64n 24x24x24x24n 48x48x48x48n 24x24x24x24n 48x48x48x48n Sustained performance [GFlop/s] Sustained performance [GFlop/s] 800 800 BiCGStab 600 600 400 400 200 200 0 0 1 2 4 1 2 4 17 Number of nodes Number of nodes

  18. K40 Performance: Strong Scaling (Single) 24x24x24x48 32x32x32x64 48x48x48x96 24x24x24x48 32x32x32x64 48x48x48x96 2500 Sustained performance [GFlop/s] 2500 Sustained performance [GFlop/s] Even-Odd preconditioned Wilson-Dirac 2000 2000 1500 1500 1000 1000 500 500 0 0 1 2 3 4 1 2 3 4 Number of nodes 32x32x32x64 Number of nodes 32x32x32x64 24x24x24x48 48x48x48x96 24x24x24x48 48x48x48x96 Sustained performance [GFlop/s] Sustained performance [GFlop/s] 1400 1400 BiCGStab 1200 1200 1000 1000 800 800 600 600 400 400 200 200 0 0 1 2 3 4 1 2 3 4 18 Number of nodes Number of nodes

  19. K40 Performance: Weak Scaling (Single) 24x24x24x24n 32x32x32x64n 48x48x48x48n 24x24x24x24n 32x32x32x64n 48x48x48x48n 2500 Sustained performance [GFlop/s] 2500 Sustained performance [GFlop/s] Even-Odd preconditioned Wilson-Dirac 2000 2000 1500 1500 1000 1000 500 500 0 0 1 2 4 1 2 4 Number of nodes 32x32x32x64n Number of nodes 32x32x32x64n 24x24x24x24n 48x48x48x48n 24x24x24x24n 48x48x48x48n Sustained performance [GFlop/s] Sustained performance [GFlop/s] 1400 1400 BiCGStab 1200 1200 1000 1000 800 800 600 600 400 400 200 200 0 0 1 2 4 1 2 4 19 Number of nodes Number of nodes

  20. Summary POWER obtained new power of calculation POWER8: High bandwidth/flop, high efficiency 350- GFlop/s on 4 nodes Higher efficiency than GPU Can be improved more by hand coding of SIMD Tesla GPU: High computational capacity, high performance 1000- GFlop/s with 8 K40 (in double precision) 2000- GFlop/s with 8 K40 (in single precision) Future work Optimization with newer GPUs and NVLINK Workload balancing POWER and GPUs Other solvers and algorithms for more efficiency 20

Related


More Related Content