Stencil Framework for Portable High-Performance Computing

Slide Note

This content provides an overview of the Stencil Framework for portable high-performance computing, detailing its applications, multi-GPU development, programming approach, Physis framework, DSL overview, writing stencils, and related issues. It explores the goal of high-level abstractions for parallel programming, portability across platforms, and maintaining performance efficiency.

trut_3 Follow

Uploaded on Feb 24, 2025 | 0 Views

Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.

Download Presentation

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.

E N D

Presentation Transcript

Stencil Framework for Portable High Performance Computing Naoya Maruyama RIKEN Advanced Institute for Computational Science April 2013 @ NCSA, IL, USA 1

Talk Outline Physis stencil framework Map Reduce for K Mini-app development

Multi-GPU Application Development CUDA Non-unified programming models MPI for inter-node parallelism CUDA/OpenCL/OpenAC C for accelerators Optimization Blocking Overlapped computation and communication MPI CUDA Good performance with low programmer productivity

High performance, highly productive programming for heterogeneous clusters Goal High level abstractions for structured parallel programming Simplifying programming models Portability across platforms Does not sacrifice too much performance Approach

Physis () Framework [SC11] Stencil DSL Declarative Portable Global-view C-based void diffusion(int x, int y, int z, PSGrid3DFloat g1, PSGrid3DFloat g2) { float v = PSGridGet(g1,x,y,z) +PSGridGet(g1,x-1,y,z)+PSGridGet(g1,x+1,y,z) +PSGridGet(g1,x,y-1,z)+PSGridGet(g1,x,y+1,z) +PSGridGet(g1,x,y,z-1)+PSGridGet(g1,x,y,z+1); PSGridEmit(g2,v/7.0); } C C+MPI DSL Compiler Target-specific code generation and optimizations Automatic parallelization Physis CUDA CUDA+MPI OpenMP OpenCL

DSL Overview C + custom data types and intrinsics Grid data types PSGrid3DFloat, PSGrid3DDouble, etc. Dense Cartesian domain types PSDomain1D, PSDomain2D, and PSDomain3D Intrinsics Runtime management Grid object management (PSGridFloat3DNew, etc) Grid accesses (PSGridCopyin, PSGridGet, etc) Applying stencils to grids (PSGridMap, PSGridRun) Grid reductions (PSGridReduce)

Writing Stencils Stencil Kernel C functions describing a single flow of scalar execution on one grid element Executed over specified rectangular domains void diffusion(const int x, const int y, const int z, PSGrid3DFloat g1, PSGrid3DFloat g2, float t) { float v = PSGridGet(g1,x,y,z) +PSGridGet(g1,x-1,y,z)+PSGridGet(g1,x+1,y,z) +PSGridGet(g1,x,y-1,z)+PSGridGet(g1,x,y+1,z) +PSGridGet(g1,x,y,z-1)+PSGridGet(g1,x,y,z+1); PSGridEmit(g2,v/7.0*t); } Issues a write to grid g2 Offset must be constant Periodic access is possible with PSGridGetPeriodic.

Applying Stencils to Grids Map: Creates a stencil closure that encapsulates stencil and grids Run: Iteratively executes stencil closures PSGrid3DFloat g1 = PSGrid3DFloatNew(NX, NY, NZ); PSGrid3DFloat g2 = PSGrid3DFloatNew(NX, NY, NZ); PSDomain3D d = PSDomain3DNew(0, NX, 0, NY, 0, NZ); PSStencilRun(PSStencilMap(diffusion,d,g1,g2,0.5), PSStencilMap(diffusion,d,g2,g1,0.5), 10); Grouping by PSStencilRun Target for kernel fusion optimization

Implementation Physis Code Implementation Source Code Executable Code DSL translator Translate intrinsics calls to RT API calls Generate GPU kernels with boundary exchanges based on static analysis Using the ROSE compiler framework (LLNL) Runtime Provides a shared memory-like interface for multidimensional grids over distributed CPU/GPU memory

CUDA Thread Blocking Each thread sweeps points in the Z dimension X and Y dimensions are blocked with AxB thread blocks, where A and B are user-configurable parameters (64x4 by default) Z Y X

Example: 7-point Stencil GPU Code __device__ void kernel(const int x,const int y,const int z,__PSGrid3DFloatDev *g, __PSGrid3DFloatDev *g2) { float v = (((((( *__PSGridGetAddrNoHaloFloat3D(g,x,y,z) + *__PSGridGetAddrFloat3D_0_fw(g,(x + 1),y,z)) + *__PSGridGetAddrFloat3D_0_bw(g,(x - 1),y,z)) + *__PSGridGetAddrFloat3D_1_fw(g,x,(y + 1),z)) + *__PSGridGetAddrFloat3D_1_bw(g,x,(y - 1),z)) + *__PSGridGetAddrFloat3D_2_bw(g,x,y,(z - 1))) + *__PSGridGetAddrFloat3D_2_fw(g,x,y,(z + 1))); *__PSGridEmitAddrFloat3D(g2,x,y,z) = v; } __global__ void __PSStencilRun_kernel(int offset0,int offset1,__PSDomain dom, __PSGrid3DFloatDev g,__PSGrid3DFloatDev g2) { int x = blockIdx.x * blockDim.x + threadIdx.x + offset0, y = blockIdx.y * blockDim.y + threadIdx.y + offset1; if (x < dom.local_min[0] || x >= dom.local_max[0] || (y < dom.local_min[1] || y >= dom.local_max[1])) return ; int z; for (z = dom.local_min[2]; z < dom.local_max[2]; ++z) { kernel(x,y,z,&g,&g2); } }

Example: 7-point Stencil CPU Code static void __PSStencilRun_0(int iter,void **stencils) { struct dim3 block_dim(64,4,1); struct __PSStencil_kernel *s0 = (struct __PSStencil_kernel *)stencils[0]; cudaFuncSetCacheConfig(__PSStencilRun_kernel,cudaFuncCachePreferL1); struct dim3 s0_grid_dim((int )(ceil(__PSGetLocalSize(0) / ((double )64))),(int )(ceil(__PSGetLocalSize(1) / ((double )4))),1); __PSDomainSetLocalSize(&s0 -> dom); s0 -> g = __PSGetGridByID(s0 -> __g_index); s0 -> g2 = __PSGetGridByID(s0 -> __g2_index); int i; for (i = 0; i < iter; ++i) {{ int fw_width[3] = {1L, 1L, 1L}; int bw_width[3] = {1L, 1L, 1L}; __PSLoadNeighbor(s0 -> g,fw_width,bw_width,0,i > 0,1); } __PSStencilRun_kernel<<<s0_grid_dim,block_dim>>>(__PSGetLocalOffset(0), __PSGetLocalOffset(1),s0 -> dom, *((__PSGrid3DFloatDev *)(__PSGridGetDev(s0 -> g))), *((__PSGrid3DFloatDev *)(__PSGridGetDev(s0 -> g2)))); } cudaThreadSynchronize(); }

Optimization Overlapped Computation and Communication Inner points 1. Copy boundaries from GPU to CPU for non-unit stride cases 3. Boundary exchanges with neighbors 2. Computes interior points 4. Computes boundaries Boundary Time

Optimization Example: 7-Point Stencil CPU Code Computing Interior Points for (i = 0; i < iter; ++i) { __PSStencilRun_kernel_interior<<<s0_grid_dim,block_dim,0, stream_interior>>> (__PSGetLocalOffset(0),__PSGetLocalOffset(1),__PSDomainShrink(&s0 -> dom,1), *((__PSGrid3DFloatDev *)(__PSGridGetDev(s0 -> g))), *((__PSGrid3DFloatDev *)(__PSGridGetDev(s0 -> g2)))); int fw_width[3] = {1L, 1L, 1L}; int bw_width[3] = {1L, 1L, 1L}; __PSLoadNeighbor(s0 -> g,fw_width,bw_width,0,i > 0,1); __PSStencilRun_kernel_boundary_1_bw<<<1,(dim3(1,128,4)),0, stream_boundary_kernel[0]>>>(__PSDomainGetBoundary(&s0 -> dom,0,0,1,5,0), *((__PSGrid3DFloatDev *)(__PSGridGetDev(s0 -> g))), *((__PSGrid3DFloatDev *)(__PSGridGetDev(s0 __PSStencilRun_kernel_boundary_1_bw<<<1,(dim3(1,128,4)),0, stream_boundary_kernel[1]>>>(__PSDomainGetBoundary(&s0 -> dom,0,0,1,5,1), *((__PSGrid3DFloatDev *)(__PSGridGetDev(s0 -> g))), *((__PSGrid3DFloatDev *)(__PSGridGetDev(s0 __PSStencilRun_kernel_boundary_2_fw<<<1,(dim3(128,1,4)),0, stream_boundary_kernel[11]>>>(__PSDomainGetBoundary(&s0 -> dom,1,1,1,1,0), *((__PSGrid3DFloatDev *)(__PSGridGetDev(s0 -> g))), *((__PSGrid3DFloatDev *)(__PSGridGetDev(s0 cudaThreadSynchronize(); } cudaThreadSynchronize(); } Computing Boundary Planes Concurrently Boundary Exchange

Local Optimization Register blocking Reuse loaded grid elements with registers for (int k = 1; k < n-1; ++k) { g[i][j][k] = a*(f[i][j][k]+f[i][j][k- 1]+f[i][j][k+1]); } Original double kc = f[i][j][0]; doubke kn = f[i][j][1]; for (int k = 1; k < n-1; ++k) { double kp = kc; kc = kn; kn = f[i][j][k+1]; g[i][j][k] = a*(kc+kp+kn); } Optimized

Local Optimization Common subexpression elimination in offset computation Eliminates intra- and inter-iteration common subexpressions for (int k = 1; k < n-1; ++k) { g[i][j][k] = a*(f[i+j*n+k*n*n]+ f[i+j*n+(k-1)*n*n]+f[i+j*n+(k+1)*n*n]); } Original int idx = i+j*n+n*n; for (int k = 1; k < n-1; ++k) { g[i][j][k] = a*(f[idx]+f[idx-n*n]+f[idx+n*n]); idx += n*n; } Optimized

Evaluation Performance and productivity Sample code 7-point diffusion kernel (#stencil: 1) Jacobi kernel from Himeno benchmark (#stencil: 1) Seismic simulation (#stencil: 15) Platform Tsubame 2.0 Node: Westmere-EP 2.9GHz x 2 + M2050 x 3 Dual Infiniband QDR with full bisection BW fat tree

s Productivity Increase? of? Lines? of? Code? 10? 8? Original? MPI? 6? Physis? 4? Generated? (No? Opt)? 2? Generated? (Opt)? 0? Diffu Similar size as sequential code in C io n? Himeno? Seismic?

Optimization Effect 7-point diffusion stencil on 1 GPU (Tesla M2050) 90 80 70 Bandwidth (GB/s) 60 50 40 30 20 10 0 Hand-tuned No-opt Register Blocking Offset CSE Full Opt

Diffusion Weak Scaling 10000 9000 512x256x256 8000 256x128x128 7000 6000 GFlops 5000 4000 3000 2000 1000 0 0 50 100 150 200 250 300 Number of GPUs

Seismic Weak Scaling Problem size: 256x256x256 per GPU 4500 4000 3500 3000 GFLOPS 2500 2000 1500 1000 500 0 0 100 200 300 400 500 600 Number of GPUs (2 GPUs per node)

Diffusion Strong Scaling Problem size: 512x512x4096 4000 1-D 3500 2-D 3000 3-D 2500 GFlops 2000 1500 1000 500 0 0 50 Number of GPUs 100 150

Himeno Strong Scaling Problem size XL (1024x1024x512) 3000 1-D 2-D 3-D 2500 2000 Gflops 1500 1000 500 0 0 50 100 150 Number of GPUs

Ongoing Work Auto-tuning Preliminary AT for the CUDA backend available Supporting different accelerators Supporting more complex problems Stencil with limited data dependency Hierarchically organized problems Work unit: Dense structured grids Overall problem domain: Sparsely connected work units Example NICAM: An Icosahedral model of climate simulation UPACS: Fluid simulation of engineering problems

Further Information Code is available at http://github.com/naoyam/physis Maruyama et al., Physis: Implicitly Parallel Programming Model for Stencil Computations on Large-Scale GPU-Accelerated Supercomputers, SC 11, 2011.

Talk Outline Physis stencil framework Map Reduce for K Mini-app development

Map Reduce for K In-memory MPI-based MapReduce for the K computer Implemented as a C library Provides (some of) Hadoop-like programming interfaces Strong focus on scalable processing of large data sets on K Supports standard MPI clusters too Application examples Metagenome sequence analysis Replica-exchange MD Runs hundreds of NAMD as a Map Fast data loading by the Tofu network

Talk Outline Physis stencil framework Map Reduce for K Mini-app development

HPC Mini Applications A set of mini-apps derived from national key applications. Source code will be publicly released around Q1- Q2 14. Expected final number of apps: < 20 Part of the national pre-exa projects Supported by the MEXT HPCI Feasibility Study program (PI: Hirofumi Tomita, RIKEN AICS)

Mini-App Methodology 1. Request for application submissions to the current users of the Japanese HPC systems Documentation Mathematical background Target capability and capacity Input data sets Validation methods Application code Full-scale applications Capable to run complete end-to-end simulations 17 submissions so far Stripped-down applications Simplified applications with only essential part of code 6 submissions so far

Mini-App Methodology 2. Deriving mini applications from submitted applications Understanding performance-critical patterns Computations, memory access, file I/O, network I/O Reducing codebase size Removing code not used for target problems Applying standard software engineering practices (e.g., DRY) Refactoring into reference implementations Performance modeling In collaboration with the ASPEN project (Jeff Vetter at ORNL) (Optional) Versions optimized for specific architecture

Mni-App Example: Molecular Dynamics Kernels: Pairwise force calculation + Long-range updates Two alternative algorithms for solving the equivalent problems FFT-based Particle Mesh Ewald Bottlenecked by all-to-all communications at scale Fast Multipole Method Tree-based problem formulation with no all-to-all communications Simplified problem settings Only simulates water molecules in the NVE setting Can reduce the codebase significantly Easier to create input data sets of different scales Two reference implementations to study performance implications by algorithmic differences MARBLE (20K SLOC) MODYLAS (16K SLOC)

Stencil Framework for Portable High-Performance Computing

Download Presentation

Presentation Transcript

Related

More Related Content