
Optimizing Loop Scheduling in OpenMP for Parallel Execution
"Learn about loop scheduling in OpenMP for efficient parallel execution, including different scheduling options like static, dynamic, guided, auto, and runtime. Understand how to divide loop iterations among threads effectively to enhance performance."
Download Presentation

Please find below an Image/Link to download the presentation.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.
You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.
E N D
Presentation Transcript
Worksharing OpenMP parallel for loops: Loop Scheduling
OpenMP loops: A primer OpenMP provides a loop construct that specifies how loop will be executed in parallel by threads in the team in the context of their implicit tasks how to divide the loop iterations on threads The schedule clause affects how loop iterations are mapped onto threads. #pragma omp parallel for schedule(kind [,chunk size]) Five kinds of schedules for OpenMP loop: static dynamic guided auto runtime The optional parameter (chunk), when specified, must be a positive integer. OpenMP implementation and its runtime defines how to assign loop chunks to threads of a team given the kind of schedule specified by as a hint.
OpenMP For Construct: The schedule clause #pragma omp parallel for schedule(static [,chunk size]) Divide the loop into equal-sized chunks or as equal as possible in the case where the number of loop iterations is not evenly divisible by the number of threads multiplied by the chunk size. By default, chunk size = (loop_count / number_of_threads). Set chunk to 1 to interleave the iterations. #pragma omp parallel for schedule(dynamic [,chunk size]) Use the internal work queue to give a chunk-sized block of loop iterations to each thread. When a thread is finished, it retrieves the next chunk-sized block of loop iterations from the top of the work queue. By default, the chunk size is 1. Be careful when using this scheduling type because of the extra overhead involved.
OpenMP For Construct: The schedule clause #pragma omp parallel for schedule(guided [,chunk size]) Similar to dynamic scheduling, but the chunk size starts off large and decreases to better handle load imbalance between iterations. The optional chunk parameter specifies the minimum size chunk to use. The difference with the dynamic scheduling type is in the size of chunks. The size of a chunk is proportional to the number of unassigned iterations divided by the number of the threads. #pragma omp parallel for schedule(auto [,chunk size]) When schedule (auto) is specified, the decision regarding scheduling is delegated to the compiler (Factors involved: platform, architecture, and compiler implementation). The programmer gives the compiler the freedom to choose any possible mapping of iterations to threads in the team. #pragma omp parallel for schedule(runtime [,chunk size]) Uses the OMP_SCHEDULE environment variable to specify scheduling type. OpenMP determines the scheduling by the internal control variable E.g. $ export OMP_SCHEDULE=sheduling-type
OpenMP parallel for loops: scheduling (examples) If each iteration is doing roughly the same amount of work, then the standard behavior of OpenMP is usually good. E.g, with 4 threads and 40 iterations, the first thread will take care of iterations 0 9, the second thread will take care of iterations 10 19, etc. This is good especially if we are doing some memory lookups in the loop; then each thread would be doing linear reading. However, if the amount of work that we do varies across iterations, then another considerations should be taken: E.g. if your code is calling a function w(i) which takes time that is proportional to the value of i. With a normal parallel for loop, thread 0 will process all small jobs, thread 3 will process all large jobs, and hence we will need to wait a lot of time until the final thread finishes
OpenMP parallel for loops: scheduling omp_set_num_threads(4); a(); #pragma omp parallel for for (int i=0; i<16; ++i) { w(i); } z(); omp_set_num_threads(4); a(); #pragma omp parallel #pragma for schedule(static,1) for (int i = 0; i < 16; ++i) { w(i); } z(); We can, however, do much better if we ask OpenMP to assign iterations to threads in a cyclic order: iteration 0 to thread 0, iteration 1 to thread 1, etc. Adding schedule(static,1) will do the trick:
OpenMP parallel for loops: staticscheduling chunk 1 indicates that we assign one iteration to each thread before switching to the next thread. If we wanted to process the iterations e.g. in chunks of size 2, we could use schedule(static,2), but in this case it is not useful: omp_set_num_threads(4); a(); #pragma omp parallel for schedule(static,2) for (int i = 0; i < 16; ++i) { w(i); } z();
OpenMP parallel for loops: Dynamic Scheduling However, sometimes our workload is tricky; maybe there are some iterations that take much longer time, and for any fixed schedule we might be unlucky and some threads will run much longer: a(); #pragma omp parallel for for (int i=0; i<16; ++i) { v(i); } z(); In this case we can resort to dynamic scheduling. Next
OpenMP parallel for loops: Dynamicscheduling a(); #pragma omp parallel for schedule(dynamic,1) for (int i=0; i < 16; ++i) { v(i); } z(); 1. First, each thread will take one iteration, and process it, 2. Then, check what is the next available iteration that is not being processed yet. It is good to note that dynamic scheduling does not necessarily give an optimal schedule.