
Communication Cost Models and Overheads
Explore cost models and overhead factors in message passing for communication systems. Learn about messaging costs, processor overhead, communication basics, and more to enhance network analysis and efficiency.
Download Presentation

Please find below an Image/Link to download the presentation.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.
You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.
E N D
Presentation Transcript
A simple cost model for message passing Messaging cost is + n* , where : per message cost : Per byte cost For the sake of intuition: Assume is 1,000 times larger than Typical: per message cost may be 1 microsecond on a supercomputer, and tens of microseconds on commodity clusters, Per byte cost may be 1 ns (corresponds to 1 Gbyte/s, not Gbit/s, bandwidth) What do we mean by messaging cost Latency Overhead to Processor 2
Communication Cost Model This model ignores several factors: It assumes there is no contention in the network It assumes latency is independent of distance Which is mostly true, in no-contention scenario We will revisit these issues when we study network topologies But note that for first-order analysis, the + n* model is good enough Also, it ignores packetization and per-packet overheads 3
Communication co-processor and CPU overhead CPU overhead: how much time is taken away from the CPU to deal with communication Function calls Data copying, if needed Setup / interaction with the network In MPI: tag matching Using a co-processor (a modern NIC will do) off-loads some of the communication work from the CPU Especially explicit data copying, in many cases For this reason, it is useful to separate overhead and latency 4
Communication Basics: Point-to-point Sending processor Sending co-processor Network Receiving co-processor Receiving processor Each cost, for a n-byte message = + n Each component has a per-message cost, and per byte cost Important metrics: Overhead at processor, co-processor Network latency Network bandwidth consumed Number of hops traversed 5
Communication Basics Message Latency: time between the application sending the message and receiving it on the other processor Send overhead: time for which the sending processor was occupied with the message Receive overhead: the time for which the receiving processor was occupied with the message Network latency Separating overhead in analysis is useful only when you are using the time until communication completes for some computation, or for setting up another communication Example: consider cost of sending 10 messages to 10 distinct processors The overheads are serialized on the CPU, the latencies are overlapped Except when we state otherwise, we will use the simple + n* model without separating overhead 6
Overall Cost Model Execution time (aka completion time) can be modeled, for many applications, as communication cost T= Tcomp + Tcomm This assumes All processes are doing the same work There is no overlap of communication and computation With Overlap, T= Tcomp + Tcomm - Toverlap Do this for Each processor (if they are doing the same thing) Worst loaded processor, if the loads are imbalanced, or longest chain (critical path) if that dominates Typically, to get the completion time expression 7
Other models: LogP, logGP, etc. LogP is an acronym: L: Latency, o: overhead, g: gap, p: processors g : time period for injecting a short fixed size message Maybe accounts for bandwidth, or contention or injection rate Model is meant for theoretical analysis of algorithms with short fixed-size messages LogGP: generalized to arbitrary size messages Starts resembling + n* model, as long as we take cognizance of overhead My advice: Write (i.e. model) completion time expression, phase-by-phase if possible, with + n* model, keeping in mind the characteristics of algorithm (critical path? Load balance? Overlap?) Always measure, and compare with model 8
Cost Model: Examples Stencil, Gauss-Seidell
Gauss-Jacobi Relaxation Decomposition by: Sequential Pseudocode: while (maxError > Threshold) { Re-apply Boundary conditions maxError = 0; for i = 0 to N-1 { for j = 0 to N-1 { B[i,j] = 0.2 * (A[i,j] + A[i,j-1] + A[i,j+1] + A[i+1, j] + A[i-1,j]) ; if (|B[i,j]- A[i,j]| > maxError) maxError = |B[i,j]- A[i,j]| } } swap B and A } Row Blocks Or Column 10
Performance Estimate Computation, Tcomp , is tc* N*N/p, where tc is the computation cost per cell For either decomposition Communication cost: Tcomm For Row decomposition: 2 messages of size N words (of 8 bytes each) 2*( + 8*N * ) For tile decomposition: 4 messages of size N/sqrt(p) words (of 8 bytes each) 4*( + 8*N * /(sqrt(p) ) Which one is better? Compare for specific values of N, p, , and 11
Gauss-Seidel Relaxation No old-new arrays.. Sequential Pseudocode: Sequentially, how well does this work? While (maxError > Threshold) { Re-apply Boundary conditions maxError = 0; for i = 0 to N-1 { for j = 0 to N-1 { old = A[i, j] A[i, j] = 0.2 * (A[i,j] + A[i,j-1] +A[i,j+1] + A[i+1,j] + A[i-1,j]) ; if (|A[i,j]-old| > maxError) maxError = |A[i,j]-old| } } } It works much better! How to parallelize this? 12
How Do We Parallelize Gauss-Seidel? Visualize the flow of values Not the control flow: That goes row-by-row Flow of dependences: which values depend on which values? Does that give us a clue on how to parallelize? 13
Parallelizing Gauss-Seidel Some ideas Row decomposition, with pipelining PE0 PE1 PE2 PEp-1 Square over-decomposition Assign many squares to a processor (essentially same?) 14
W Row decomposition with pipelining N W W N 1 1 2 2 ... ... P P ... ... ... N/P # Columns = N/W # Rows = P N W W N N + 1 N + 1 W W 2 2 ... ... P P ... ... ... N N W W N N + 1 N + 1 W W ... ... P P ... ... ... ... # Of Phases N/W N W W N + (P-1) N + 1 W P P ... ... ... ... ? ?+1 ? ?+P-1 N 15
Row decomposition, with pipelining Number of Procs Used P 0 P N W N + P -1 W Time 16
Apply the cost model to Gauss-Seidel Apply the cost model to Gauss-Seidel, and then find the optimal w Completion time: Sum of 3 phases: rising, plateau, falling phase Communication in each phase is 1 message of W words: 17
MPI Performance Analysis and Tools PMPI, mpiP, Based on slides by Martin
MPI Profiling PMPI profiling interface provided by MPI standard 20
Many Profilers Available JumpShot TAU HPC Toolkit TotalView Allinea Performance Reports PGProf STAT Intel Vtune MPIP We focus on this as an example Recommend using the one available on the system of interest 21
MPIP Software developed by LLNL. Collects only statistical information about MPI routines Easy to download and install: github.com/LLNL/mpiP Available on many systems Link before the MPI library mpicxx -g -o myprog myprog.c -L/usr/local/tools/mpiP/lib -lmpiP - lbfd -liberty -lintl -lm Slides based on https://computing.llnl.gov/tutorials/bgq/#mpiP 22
MPIP mpiP's output file is divided into 5 sections: Environment Information MPI Time Per Task Callsites Aggregate Times of Top 20 Callsites Callsite Statistics 23
High level indicator of whether communication is an issue or not! 25
References MPIP Slides based on material by Martin Schultz et al https://computing.llnl.gov/tutorials/bgq/#mpiP Also see: https://github.com/LLNL/mpiP And http://mpip.sourceforge.net/ 29
Hybrid Programming MPI and OpenMP
MPI, OpenMP and Pthreads MPI describes parallelism between processes (with separate address spaces) Thread parallelism provides a shared-memory model within a process OpenMP and Pthreads are common models OpenMP provides convenient features for loop-level parallelism. Threads are created and managed by the compiler, based on user directives. Pthreads provide more complex and dynamic approaches. Threads are created and managed explicitly by the user. 31
(Balaji, Gropp, Hoefler, Thakur, 2018) How to program on modern systems? Today s clusters often comprise multiple CPUs per node sharing memory, and the nodes themselves are connected by a network 32
(Balaji, Gropp, Hoefler, Thakur, 2018) Hybridization and its benefits The use of inherently different models of programming in a complimentary manner, in order to achieve some benefit not possible otherwise; A way to use different models of parallelization in a way that takes advantage of the good points of each May help if Introducing MPI into OpenMP applications can help scale across multiple SMP nodes Introducing OpenMP into MPI applications can help make more efficient use of the shared memory on SMP nodes, thus mitigating the need for explicit intra-node communication 33
Basic Hybrid Stub #include <omp.h> #include "mpi.h #define _NUM_THREADS 4 int main (int argc, char *argv[]) { int p,my_rank,c; omp_set_num_threads(_NUM_THREADS); MPI_Init(&argc, &argv); MPI_Comm_size(MPI_COMM_WORLD,&p); MPI_Comm_rank(MPI_COMM_WORLD,&my_rank); #pragma omp parallel reduction(+:c) { c = omp_get_num_threads(); } MPI_Finalize(); 34
Compiling and running Compilation: mpicc -openmp test.c o test Running export OMP_NUM_THREADS=8 mpirun -np 4 ./test 35
(Balaji, Gropp, Hoefler, Thakur, 2018) MPI + OpenMP MPI defines an alternative to MPI_Init: MPI_Init_thread(reqd, provided) Application indicates what level of thread support it needs (reqd) MPI returns the level of thread support it provides (provided) MPI defines four levels of thread safety: 1. MPI_THREAD_SINGLE: There is no OpenMP multithreading in the program. 2. MPI_THREAD_FUNNELED: All of the MPI calls are made by the master thread, i.e. all MPI calls are: Outside OpenMP parallel regions, or Inside OpenMP master regions, or Guarded by call to MPI_Is_thread_main 3. MPI_THREAD_SERIALIZED #pragma omp critical { MPI calls allowed here } 4. MPI_THREAD_MULTIPLE: Any thread may make an MPI call at any time 36
(Balaji, Gropp, Hoefler, Thakur, 2018) Specification of MPI_THREAD_MULTIPLE When multiple threads make MPI calls concurrently, the outcome will be as if the calls executed sequentially in some (any) order Blocking MPI calls will block only the calling thread and will not prevent other threads from running or executing MPI functions It is the user's responsibility to prevent races when threads in the same application post conflicting MPI calls e.g., accessing an info object from one thread and freeing it from another thread User must ensure that collective operations on the same communicator, window, or file handle are correctly ordered among threads e.g., cannot call a broadcast on one thread and a reduce on another thread on the same communicator 37
(Balaji, Gropp, Hoefler, Thakur, 2018) Threads and MPI The MPI implementation is not required to support levels higher than MPI_THREAD_SINGLE; that is, it is not required to be thread safe A fully thread-safe implementation will support MPI_THREAD_MULTIPLE A program that calls MPI_Init (instead of MPI_Init_thread) should assume that only MPI_THREAD_SINGLE is supported A threaded MPI program that does not call MPI_Init_thread is an incorrect program (common user error we see) 38
(Balaji, Gropp, Hoefler, Thakur, 2018) An Incorrect Program: What is wrong here? Process 1 Process 0 MPI_Bcast(comm) Thread 1 MPI_Bcast(comm) MPI_Barrier(comm) MPI_Barrier(comm) Thread 2 Here the user must use some kind of synchronization to ensure that either thread 1 or thread 2 gets scheduled first on both processes Otherwise a broadcast may get matched with a barrier on the same communicator, which is not allowed in MPI 39
(Balaji, Gropp, Hoefler, Thakur, 2018) A Correct Example: why is this right? Process 1 Process 0 MPI_Recv(src=0) Thread 1 MPI_Recv(src=1) MPI_Send(dst=0) MPI_Send(dst=1) Thread 2 The MPI implementation must ensure that the above example never deadlocks for any ordering of thread execution That means the implementation cannot simply acquire a thread lock and block within an MPI function. It must release the lock to allow other threads to make progress. 40
(Balaji, Gropp, Hoefler, Thakur, 2018) Performance with MPI_THREAD_MULTIPLE All MPI implementations support MPI_THREAD_SINGLE They probably support MPI_THREAD_FUNNELED even if they don t admit it. Does require thread-safe malloc Probably OK in OpenMP programs Many (but not all) implementations support THREAD_MULTIPLE Hard to implement efficiently though (lock granularity issue) Thread safety does not come for free The implementation must protect certain data structures or parts of code with mutexes or critical sections 41
(Balaji, Gropp, Hoefler, Thakur, 2018) Process Memory Requirements MPI vs. MPI + OpenMP Separate processes need separate address and memory space There are much more lightweight memory requirements for OpenMP threads Only one copy of MPI buffers etc. exists per process, and therefore only one copy exists shared between all threads launched from a process Using MPI/OpenMP hybrid programming reduces the memory requirement overhead from multiple processes 42
(Balaji, Gropp, Hoefler, Thakur, 2018) Halo Regions Halo regions are local copies of remote data that are needed for computations (remember jacobi example?) Using OpenMP parallelism reduces the size of halos region copies that need to be stored Reducing halo region sizes also reduces communication requirements 43
(Balaji, Gropp, Hoefler, Thakur, 2018) Overlapping Communication with Computation While some threads take care of communication, other threads can get the computation work done Earlier, we were trying to achieve this using non-blocking operations and careful overlap 44
Hybrid Programming: Benefits and Pitfalls Natural use is when master thread does communication, and then you parallelize computational loops using OpenMP, or Pthreads The problem here is that you decreased computation time significantly, but kept the communication time the same, which means overall efficiency suffers Attempt to overlap computation with communication Complexity increases because of correctness issues and performance gotcha s (pitfalls.. Unexpected performance losses) E.g. locking costs of shared data structures, But is necessary: MPI-everywhere (i.e. a rank on each hardware thread or core) leaves too much capability un-exploited E.g. Avoiding copies of common (shared, read-only or read-mostly) data 45
Reference Pavan Balaji, William Gropp, Torsten Hoefler, and Rajiv Thakur 2018 [Retrieved from : https://anl.app.box.com/v/balaji-tutorials- 2018/ 46