High Performance Fortran: Understanding Concurrent Execution Features
In this presentation, delve into the history and key directives of High Performance Fortran (HPF). Explore how HPF enables data parallelism in Fortran 90, focusing on concepts like the PROCESSORS directive, DISTRIBUTE directive, ALIGN directive, and more. Learn about the motivations behind HPF development, its current status, and examples like Jacobi relaxation. Discover how HPF simplifies parallel programming and supports performance optimization on distributed memory machines.
Download Presentation

Please find below an Image/Link to download the presentation.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.
You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.
E N D
Presentation Transcript
High Performance Fortran Slides based on John Merlin s slides on HPF
Motivation OpenMP: Performance issues are not visible to the programmer Jacobi Relaxation: The ghost layers are brought to a processor s cache through communication Solutions for a distributed memory machine? 2/28/2025
HPF History Developed on CMFORTRAN and FORTRAN90 Became extinct: Was not powerful enough Ideas/compilers were not clear Became extinct before Earth simulator brought it back Being worked on again at Rice, and in Japan 2/28/2025
PROCESSORS directive Declare processor arrangements: !HPF$ PROCESSORS p (4), q (NUMBER_OF_PROCESSORS()/2, 2) !HPF$ PROCESSORS s PROCESSORS declares abstract processors They may actually be processes and can have > 1 process on a physical processor 2/28/2025
DISTRIBUTE directive Distributes an array over a processor array 2/28/2025
ALIGN directive Relates an elements of an array to elements of another array such that the aligned elements are on the same processor 2/28/2025
Concurrent execution features Features that express the potential for data parallelism and concurrency FORTRAN90 array syntax The concurrent execution features introduced by HPF 1.0: FORALL PURE procedures The INDEPENDENT directive 2/28/2025
Data parallelism in Fortran 90 Fortran 90 has no explicit parallelism but its array features have implicit data parallelism 2/28/2025
Example: Jacobi relaxation 2/28/2025
FORALL statement FORALL is a data parallel loop. E.g. FORALL (i = 2:9) A(i) = 0.5*(A(i-1) + A(i+1)) Is equivalent to A (2:9) = 0.5 * ( A (1:8) + A (3:10) ) 2/28/2025
PURE functions and subroutines There are no side-effects, so can be executed concurrently 2/28/2025
INDEPENDENT directive This asserts that iterations of a DO-loop or FORALL are data independent and can be executed concurrently INDEPENDENT is an assertion about the behavior of the program. If it is false, the program may fail 2/28/2025
HPF and MPI To support generality, we can make MPI calls through extrinsic (non- HPF) procedures EXTRINSIC (HPF_SERIAL) to call Fortran95 on one processor EXTRINSIC (HPF_LOCAL) to call Fortran95 concurrently on all processors 2/28/2025
Co-Array Fortran Laxmikant V. Kale
Co-array Fortran Idea: provide the familiar Fortran-95 programming model with minimal extensions to refer to data on remote nodes To synchronize (wait for other processors to arrive at some point in their program) Syntax extension: Normal Fortan arrays use (i) to indicate index. CAF uses, in addition, a [p] to indicate data from processor p More info at: www.co-array.org http://www.hipersoft.rice.edu/caf Esp. See first 8-10 slides of the Co-array Fortran presentation at: http://www.hipersoft.rice.edu/caf/publications/index.html 2/28/2025 Co-Array Fortran 17
REAL, DIMENSION(N)[*] :: X,Y X(:) = Y(:)[Q] X = Y[PE] ! get from Y[PE] Y[PE] = X ! put into Y[PE] Y[:] = X ! broadcast X Y[LIST] = X ! broadcast X over subset of PE's in array LIST Z(:) = Y[:] ! collect all Y S = MINVAL(Y[:]) ! min (reduce) all Y B(1:M)[1:N] = S ! S scalar, promoted to array of shape (1:M,1:N) 2/28/2025 Co-Array Fortran 18
One-sided Communication with Co-Arrays integer a(10,20)[*] a(10,20) a(10,20) a(10,20) image 1 image 2 image N if (this_image() > 1) a(1:10,1:2) = a(1:10,19:20)[this_image()-1] image 1 image 2 image N 2/28/2025 Co-Array Fortran 19
Synchronization and Misc. SYNC_ALL() barrier SYNC_ALL(LIST) everyone waits for their own list SYNC_TEAM(TEAM) only team is involved in sync SYNC_TEAM(TEAM, LIST) combination of above NUM_IMAGES(), THIS_IMAGE() 2/28/2025 Co-Array Fortran 20
CAF Programming Model Features Synchronization intrinsic functions sync_all a barrier and a memory fence sync_mem a memory fence sync_team([notify], [wait]) notify = a vector of process ids to signal wait = a vector of process ids to wait for, a subset of notify Pointers and (perhaps asymmetric) dynamic allocation Parallel I/O 2/28/2025 Co-Array Fortran 21
Shmem, ARMCI, and Global arrays SHMEM (shared memory) on Cray T3D Basic idea: non-cache-coherent shared memory across nodes primitives: Get and Put: copy data between calling processor and remote processor, synchronously SGI s version: http://www.shmem.org/ Open SHMMEM http://sc08.supercomputing.org/scyourway/conference/view/bof140.html 2/28/2025 Co-Array Fortran 22
Rice Implementation See: John Mellor-Crummey s presentation History of shmem Cray T3D: alpha 21064 Put and get primitives (shmem_get..) Non-cache coherent shared memory.. That s the idea that stayed with us.. In hardware In software 2/28/2025 Co-Array Fortran 23
Other partitioned Global Adress Space systems Global Arrays Library (no compiler magic) Declare global arays and partition them across nodes Get and put primitives for tiles (sub-sections) of global array NWChem is a major application developed using it UPC and UPC++ (Unified Parallel C) Take the idea of a pointer with pointer arithmetic to the distributed memory world Berkeley implementation and GWU implementation An important system still in use 2/28/2025 Co-Array Fortran 24
PGAS: Partitioned Global Address Space PGAS languages and libraries support global view of data Global Arrays Library (no compiler magic) Declare global arrays and partition them across nodes Get and put primitives for tiles (sub-sections) of global array NWChem is a major application developed using it UPC and UPC++ (Unified Parallel C) Takes the idea of a pointer with pointer arithmetic to the distributed memory world Berkeley implementation and GWU implementation CAF (Co-array Fortran): is now incorporated in Fortran standard L.V.Kale 26
Other programming models Task Based Parallel Programming Models ParSEC Legion Higher level compiled languages: Chapel, Regent, .. C++ based languages STAPL, DARMA, FleCSI L.V.Kale 27