
Understanding Parallel Programming Models
"Explore the concepts of parallel programming models such as MapReduce, MPI, Amdahl's Law, and guidelines for efficient parallelization in distributed systems and computer architectures."
Download Presentation

Please find below an Image/Link to download the presentation.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.
You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.
E N D
Presentation Transcript
Distributed Systems CS 15-440 Programming Models Gregory Kesden Borrowed and adapted from our good friends at CMU-Doha, Qatar Majd F. Sakr, Mohammad Hammoud andVinay Kolar 1
Objectives Discussion on Programming Models MapReduce Message Passing Interface (MPI) Examples of parallel processing Traditional models of parallel programming Parallel computer architectures Why parallelism? parallelism? Why
Amdahls Law We parallelize our programs in order to run them faster How much faster will a parallel program run? Suppose that the sequential execution of a program takes T1time units and the parallel execution on p processors takes Tp time units Suppose that out of the entire execution of the program, s fraction of it is not parallelizable while 1-s fraction is parallelizable Then the speedup (Amdahl s formula): 3
Amdahls Law: An Example Suppose that 80% of you program can be parallelized and that you use 4 processors to run your parallel version of the program The speedup you can get according to Amdahl is: Although you use 4 processors you cannot get a speedup more than 2.5 times (or 40% of the serial running time) 4
Real Vs. Actual Cases Amdahl s argument is too simplified to be applied to real cases When we run a parallel program, there are a communication overhead and a workload imbalance among processes in general 20 80 20 80 Serial Serial Parallel 20 20 Parallel 20 20 Process 1 Process 1 Process 2 Process 2 Cannot be parallelized Process 3 Process 3 Cannot be parallelized Can be parallelized Communication overhead Process 4 Process 4 Can be parallelized Load Unbalance 2. Parallel Speed-up: An Actual Case 1. Parallel Speed-up: An Ideal Case
Guidelines In order to efficiently benefit from parallelization, we ought to follow these guidelines: 1. Maximize the fraction of our program that can be parallelized 2. Balance the workload of parallel processes 3. Minimize the time spent for communication 6
Objectives Discussion on Programming Models MapReduce Message Passing Interface (MPI) Examples of parallel processing Traditional models of parallel programming Parallel computer architectures architectures Parallel computer Why parallelism?
Parallel Computer Architectures We can categorize the architecture of parallel computers in terms of two aspects: Whether the memory is physically centralized or distributed Whether or not the address space is shared Address Space Shared Shared Shared Shared Shared Address Space Address Space Address Space Address Space Memory Memory Memory Memory Memory Individual Individual Individual Individual Individual SMP (Symmetric Multiprocessor) SMP (Symmetric Multiprocessor) SMP (Symmetric Multiprocessor) SMP (Symmetric Multiprocessor) UMA SMP (Symmetric Multiprocessor) N/A N/A N/A N/A N/A Centralized Distributed Distributed Distributed Distributed Distributed Centralized Centralized Centralized Centralized NUMA (Non-Uniform Memory Access) NUMA (Non-Uniform Memory Access) NUMA (Non-Uniform Memory Access) NUMA (Non-Uniform Memory Access) NUMA (Non-Uniform Memory Access) MPP (Massively MPP (Massively MPP (Massively MPP (Massively MPP (Massively Parallel Processors) Parallel Processors) Parallel Processors) Parallel Processors) Parallel Processors) 8
Symmetric Multiprocessor Symmetric Multiprocessor (SMP) architecture uses shared system resources that can be accessed equally from all processors Processor Processor Processor Processor Cache Cache Cache Cache Bus or Crossbar Switch Memory I/O A single OS controls the SMP machine and it schedules processes and threads on processors so that the load is balanced 9
Massively Parallel Processors Massively Parallel Processors (MPP) architecture consists of nodes with each having its own processor, memory and I/O subsystem Interconnection Network Processor Processor Processor Processor Cache Cache Cache Cache Bus Bus Bus Bus Memory Memory Memory Memory I/O I/O I/O I/O An independent OS runs at each node 10
Non-Uniform Memory Access Non-Uniform Memory Access (NUMA) architecture machines are built on a similar hardware model as MPP NUMA typically provides a shared address space to applications using a hardware/software directory-based coherence protocol The memory latency varies according to whether you access memory directly (local) or through the interconnect (remote). Thus the name non-uniform memory access As in an SMP machine, a single OS controls the whole system 11
Objectives Discussion on Programming Models MapReduce Message Passing Interface (MPI) Examples of parallel processing Traditional Models of parallel programming programming Traditional Models of parallel Parallel computer architectures Why parallelizing our programs?
Models of Parallel Programming What is a parallel programming model? A programming model is an abstraction provided by the hardware to programmers It determines how easily programmers can specify their algorithms into parallel unit of computations (i.e., tasks) that the hardware understands It determines how efficiently parallel tasks can be executed on the hardware Main Goal: utilize all the processors of the underlying architecture (e.g., SMP, MPP, NUMA) and minimize the elapsed time of your program 13
Traditional Parallel Programming Models Parallel Programming Models Shared Memory Message Passing Message Passing 14
Shared Memory Model In the shared memory programming model, the abstraction is that parallel tasks can access any location of the memory Parallel tasks can communicate through reading and writing common memory locations This is similar to threads from a single process which share a single address space Multi-threaded programs (e.g., OpenMP programs)are the best fit with shared memory programming model 15
Shared Memory Model Single Thread Multi-Thread Si = Serial Pj = Parallel Time Time S1 Spawn S1 P1 P3 P1 P2 P3 P2 Join P3 S2 Shared Address Space P4 S2 Process Process 16
Shared Memory Example begin parallel // spawn a child thread private int start_iter, end_iter, i; shared int local_iter=4, sum=0; shared double sum=0.0, a[], b[], c[]; shared lock_type mylock; start_iter = getid() * local_iter; end_iter = start_iter + local_iter; for (i=start_iter; i<end_iter; i++) a[i] = b[i] + c[i]; barrier; for (i=0; i<8; i++) a[i] = b[i] + c[i]; sum = 0; for (i=0; i<8; i++) if (a[i] > 0) sum = sum + a[i]; Print sum; Sequential for (i=start_iter; i<end_iter; i++) if (a[i] > 0) { lock(mylock); sum = sum + a[i]; unlock(mylock); } barrier; // necessary end parallel // kill the child thread Print sum; Parallel
Traditional Parallel Programming Models Parallel Programming Models Shared Memory Shared Memory Message Passing 18
Message Passing Model In message passing, parallel tasks have their own local memories One task cannot access another task s memory Hence, to communicate data they have to rely on explicit messages sent to each other This is similar to the abstraction of processes which do not share an address space MPI programming model programs are the best fit with message passing 19
Message Passing Model Message Passing Single Thread S = Serial P = Parallel Time Time S1 S1 S1 S1 S1 P1 P1 P1 P1 P1 P2 S2 S2 S2 S2 P3 P4 Process 0 Process 1 Process 2 Process 3 S2 Node 1 Node 2 Node 3 Node 4 Data transmission over the Network Process 20
Message Passing Example id = getpid(); local_iter = 4; start_iter = id * local_iter; end_iter = start_iter + local_iter; if (id == 0) send_msg (P1, b[4..7], c[4..7]); else recv_msg (P0, b[4..7], c[4..7]); for (i=0; i<8; i++) a[i] = b[i] + c[i]; sum = 0; for (i=0; i<8; i++) if (a[i] > 0) sum = sum + a[i]; Print sum; Sequential for (i=start_iter; i<end_iter; i++) a[i] = b[i] + c[i]; local_sum = 0; for (i=start_iter; i<end_iter; i++) if (a[i] > 0) local_sum = local_sum + a[i]; if (id == 0) { recv_msg (P1, &local_sum1); sum = local_sum + local_sum1; Print sum; } else send_msg (P0, local_sum); Parallel
Shared Memory Vs. Message Passing Comparison between shared memory and message passing programming models: Aspect Aspect Aspect Aspect Aspect Shared Memory Shared Memory Shared Memory Shared Memory Shared Memory Message Passing Message Passing Message Passing Message Passing Message Passing Communication Communication Communication Communication Communication Implicit (via loads/stores) Implicit (via loads/stores) Implicit (via loads/stores) Implicit (via loads/stores) Implicit (via loads/stores) Explicit Messages Explicit Messages Explicit Messages Explicit Messages Explicit Messages Synchronization Synchronization Synchronization Synchronization Synchronization Explicit Explicit Explicit Explicit Explicit Implicit (Via Messages) Implicit (Via Messages) Implicit (Via Messages) Implicit (Via Messages) Implicit (Via Messages) Hardware Support Hardware Support Hardware Support Hardware Support Hardware Support Typically Required Typically Required Typically Required Typically Required Typically Required None None None None None Development Effort Development Effort Development Effort Development Effort Development Effort Lower Lower Lower Lower Lower Higher Higher Higher Higher Higher Tuning Effort Tuning Effort Tuning Effort Tuning Effort Tuning Effort Higher Higher Higher Higher Higher Lower Lower Lower Lower Lower 22
Objectives Discussion on Programming Models MapReduce Message Passing Interface (MPI) Examples of parallel processing processing Examples of parallel Traditional Models of parallel programming Parallel computer architectures Why parallelizing our programs?
SPMD and MPMD When we run multiple processes with message-passing, there are further categorizations regarding how many different programs are cooperating in parallel execution We distinguish between two models: 1. Single Program Multiple Data (SPMD) model 2. Multiple Programs Multiple Data (MPMP) model 24
SPMD In the SPMD model, there is only one program and each process uses the same executable working on different sets of data a.out Node 1 Node 3 Node 2 25
MPMD The MPMD model uses different programs for different processes, but the processes collaborate to solve the same problem MPMD has two styles, the master/worker and the coupled analysis a.out= Structural Analysis, b.out = fluid analysis and c.out = thermal analysis a.out a.out b.out b.out c.out Example Node 1 Node 1 Node 2 Node 2 Node 3 Node 3 1. MPMD: Master/Slave 2. MPMD: Coupled Analysis
3 Key Points To summarize, keep the following 3 points in mind: The purpose of parallelization is to reduce the time spent for computation Ideally, the parallel program is p times faster than the sequential program, where p is the number of processes involved in the parallel execution, but this is not always achievable Message-passing is the tool to consolidate what parallelization has separated. It should not be regarded as the parallelization itself 27
Objectives Discussion on Programming Models MapReduce Message Passing Interface (MPI) (MPI) Message Passing Interface Examples of parallel processing Traditional Models of parallel programming Parallel computer architectures Why parallelizing our programs?
Message Passing Interface In this part, the following concepts of MPI will be described: Basics Point-to-point communication Collective communication 29
What is MPI? The Message Passing Interface (MPI) is a message passing library standard for writing message passing programs The goal of MPI is to establish a portable, efficient, and flexible standard for message passing By itself, MPI is NOT a library - but rather the specification of what such a library should be MPI is not an IEEE or ISO standard, but has in fact, become the industry standard for writing message passing programs on HPC platforms 30
Reasons for using MPI Reason Reason Reason Reason Reason Description Description Description Description Description Standardization Standardization Standardization Standardization Standardization MPI is the only message passing library which can be considered a standard. It is supported on virtually all considered a standard. It is supported on virtually all considered a standard. It is supported on virtually all considered a standard. It is supported on virtually all MPI is the only message passing library which can be MPI is the only message passing library which can be MPI is the only message passing library which can be MPI is the only message passing library which can be considered a standard. It is supported on virtually all HPC platforms HPC platforms HPC platforms HPC platforms HPC platforms Portability Portability Portability Portability There is no need to modify your source code when you port There is no need to modify your source code when you port There is no need to modify your source code when you port your application to a different platform that supports the your application to a different platform that supports the your application to a different platform that supports the There is no need to modify your source code when you port your application to a different platform that supports the MPI standard MPI standard MPI standard MPI standard Performance Opportunities Performance Opportunities Performance Opportunities Vendor implementations should be able to exploit native hardware features to optimize performance hardware features to optimize performance hardware features to optimize performance Vendor implementations should be able to exploit native Vendor implementations should be able to exploit native Functionality Functionality Over 115 routines are defined Over 115 routines are defined Availability A variety of implementations are available, both vendor and public domain 31
Programming Model MPI is an example of a message passing programming model MPI is now used on just about any common parallel architecture including MPP, SMP clusters, heterogeneous networks workstation clusters and With MPI the programmer is responsible for correctly identifying parallelism and implementing parallel algorithms using MPI constructs 32
Communicators and Groups MPI uses objects called communicators and groups to define which collection of processes may communicate with each other to solve a certain problem Most MPI routines require you to specify a communicator as an argument The communicator MPI_COMM_WORLD is often used in calling communication subroutines MPI_COMM_WORLD is the predefined communicator that includes all of your MPI processes 33
Ranks Within a communicator, every process has its own unique, integer identifier referred to as rank, assigned by the system when the process initializes A rank is sometimes called a task ID. Ranks are contiguous and begin at zero Ranks are used by the programmer to specify the source and destination of messages Ranks are often also used conditionally by the application to control program execution (e.g., if rank=0 do this / if rank=1 do that) 34
Multiple Communicators It is possible that a problem consists of several sub-problems where each can be solved concurrently This type of application is typically found in the category of MPMD coupled analysis We can create a new communicator for each sub-problem as a subset of an existing communicator MPI allows you to achieve that by using MPI_COMM_SPLIT 35
Example of Multiple Communicators Consider a problem with a fluid dynamics part and a structural analysis part, where each part can be computed in parallel MPI_COMM_WORLD Comm_Fluid Comm_Struct Rank=0 Rank=1 Rank=0 Rank=1 Rank=0 Rank=1 Rank=4 Rank=5 Rank=2 Rank=3 Rank=2 Rank=3 Rank=2 Rank=3 Rank=6 Rank=7 Ranks within MPI_COMM_WORLD are printed in red Ranks within Comm_Fluid are printed with green Ranks within Comm_Struct are printed with blue
Next Class Discussion on Programming Models MapReduce Message Passing Interface (MPI) (MPI) Message Passing Interface Examples of parallel processing Traditional Models of parallel programming Parallel computer architectures Programming Models- Part II Why parallelizing our programs?
Message Passing Interface In this part, the following concepts of MPI will be described: Basics Point-to-point communication Collective communication 38
Point-to-Point Communication MPI point-to-point operations typically involve message passing between two, and only two, different MPI tasks Processor1 Processor2 Network One task performs a send operation and the other performs a matching receive operation sendA recvA Ideally, every send operation would be perfectly synchronized with its matching receive This is rarely the case. Somehow or other, the MPI implementation must be able to deal with storing data when the two tasks are out of sync 39
Two Cases Consider the following two cases: 1. A send operation occurs 5 seconds before the receive is ready - where is the message stored while the receive is pending? 2. Multiple sends arrive at the same receiving task which can only accept one send at a time - what happens to the messages that are "backing up"? 40
Steps Involved in Point-to-Point Communication Process 0 Sender 1. The data is copied to the user buffer by the user 1. The user calls one of the MPI receive routines User Mode Kernel Mode sendbuf 1 sysbuf 2 2. The user calls one of the MPI send routines 2. The receives the data from the source process copies it to the system buffer system Call a send routine Copying data from sendbuf to sysbuf 3 Send data from sysbuf to destination Now sendbuf can be reused and 4 3. The system copies the data from the user buffer to the system buffer Data Process 1 Receiver 3. The system copies data from system buffer to the user buffer User Mode Kernel Mode the Receive data from source to sysbuf Call a recev routine 1 4. The system sends the data from the system buffer to the destination process 2 4 sysbuf Now recvbuf contains valid data 4. The data in the user buffer user uses Copying data from sysbuf to recvbuf recvbuf 3
Blocking Send and Receive When we use point-to-point communication routines, we usually distinguish between blocking and non-blocking communication A blocking send routine will only return after it is safe to modify the application buffer for reuse Safe to modify sendbuf Safe means that modifications will not affect the data intended for the receive task Rank 0 Rank 1 sendbuf recvbuf Network This does not imply that the data was actually received by the receiver- it may be sitting in the system buffer at the sender side recvbuf sendbuf 42
Blocking Send and Receive A blocking send can be: 1. Synchronous: Means there is a handshaking occurring with the receive task to confirm a safe send 2. Asynchronous: Means the system buffer at the sender side is used to hold the data for eventual delivery to the receiver A blocking receive only returns after the data has arrived (i.e., stored at the application recvbuf) and is ready for use by the program 43
Non-Blocking Send and Receive (1) Non-blocking send and non-blocking receive routines behave similarly They return almost immediately They do not wait for any communication events to complete such as: Message copying from user buffer to system buffer Or the actual arrival of a message 44
Non-Blocking Send and Receive (2) However, it is unsafe to modify the application buffer until you make sure that the requested non-blocking operation was actually performed by the library If you use the application buffer before the copy completes: Incorrect (in case of non-blocking send) Or your receive buffer does not contain (in case of non-blocking receive) data may be copied to the system buffer what you want You can make sure of the completion of the copy by using MPI_WAIT() after the send or receive operations 45
Why Non-Blocking Communication? Why do we use non-blocking communication despite its complexity? Non-blocking communication is generally faster than its corresponding blocking communication We can overlap computations while the system is copying data back and forth between application and system buffers 46
MPI Point-To-Point Communication Routines Routine Routine Routine Routine Routine Signature Signature Signature Signature Signature Blocking send Blocking send Blocking send Blocking send Blocking send int MPI_Send( void *buf, int count, MPI_Datatype datatype, int dest, int tag, MPI_Comm comm ) dest, int tag, MPI_Comm comm ) dest, int tag, MPI_Comm comm ) dest, int tag, MPI_Comm comm ) dest, int tag, MPI_Comm comm ) int MPI_Send( void *buf, int count, MPI_Datatype datatype, int int MPI_Send( void *buf, int count, MPI_Datatype datatype, int int MPI_Send( void *buf, int count, MPI_Datatype datatype, int int MPI_Send( void *buf, int count, MPI_Datatype datatype, int Non-blocking send Non-blocking send Non-blocking send Non-blocking send Non-blocking send int MPI_Isend( void *buf, int count, MPI_Datatype datatype, int dest, int tag, MPI_Comm comm, MPI_Request *request ) dest, int tag, MPI_Comm comm, MPI_Request *request ) dest, int tag, MPI_Comm comm, MPI_Request *request ) dest, int tag, MPI_Comm comm, MPI_Request *request ) dest, int tag, MPI_Comm comm, MPI_Request *request ) int MPI_Isend( void *buf, int count, MPI_Datatype datatype, int int MPI_Isend( void *buf, int count, MPI_Datatype datatype, int int MPI_Isend( void *buf, int count, MPI_Datatype datatype, int int MPI_Isend( void *buf, int count, MPI_Datatype datatype, int Blocking receive Blocking receive Blocking receive Blocking receive Blocking receive int MPI_Recv( void *buf, int count, MPI_Datatype datatype, int source, int tag, MPI_Comm comm, MPI_Status *status ) source, int tag, MPI_Comm comm, MPI_Status *status ) source, int tag, MPI_Comm comm, MPI_Status *status ) source, int tag, MPI_Comm comm, MPI_Status *status ) source, int tag, MPI_Comm comm, MPI_Status *status ) int MPI_Recv( void *buf, int count, MPI_Datatype datatype, int int MPI_Recv( void *buf, int count, MPI_Datatype datatype, int int MPI_Recv( void *buf, int count, MPI_Datatype datatype, int int MPI_Recv( void *buf, int count, MPI_Datatype datatype, int Non-blocking receive Non-blocking receive Non-blocking receive Non-blocking receive Non-blocking receive int MPI_Irecv( void *buf, int count, MPI_Datatype datatype, int source, int tag, MPI_Comm comm, MPI_Request *request ) source, int tag, MPI_Comm comm, MPI_Request *request ) source, int tag, MPI_Comm comm, MPI_Request *request ) source, int tag, MPI_Comm comm, MPI_Request *request ) source, int tag, MPI_Comm comm, MPI_Request *request ) int MPI_Irecv( void *buf, int count, MPI_Datatype datatype, int int MPI_Irecv( void *buf, int count, MPI_Datatype datatype, int int MPI_Irecv( void *buf, int count, MPI_Datatype datatype, int int MPI_Irecv( void *buf, int count, MPI_Datatype datatype, int 47
Message Order MPI guarantees that messages will not overtake each other If a sender sends two messages M1 and M2 in succession to the same destination, and both match the same receive, the receive operation will receive M1 before M2 If a receiver posts two receives R1 and R2, in succession, and both are looking for the same message, R1 will receive the message before R2 48
Fairness MPI does not guarantee fairness it is up to the programmer to prevent operation starvation For instance, if task 0 and task 1 send competing messages (i.e., messages that match the same receive) to task 2, only one of the sends will complete Task 0 Task 1 Msg A Msg A ? Task 2 49
Unidirectional Communication When you send a message from process 0 to process 1, there are four combinations of MPI subroutines to choose from Rank 0 Rank 1 1. Blocking send and blocking receive sendbuf recvbuf 2. Non-blocking send and blocking receive recvbuf sendbuf 3. Blocking send and non-blocking receive 4. Non-blocking send and non-blocking receive 50