Understanding Parallel Programming with OpenMP: Work Complexity and Prefix Sum Algorithms

parallel programming openmp scan work complexity n.w
1 / 20
Embed
Share

Dive into the realm of parallel programming with OpenMP as we explore the concepts of work complexity, reduction, and prefix sum operations. Learn how to optimize algorithms for parallel execution and understand the intricate details of scan operations. Discover the uses of scan in various applications such as sorting, string comparison, and building data structures.

  • Parallel Programming
  • OpenMP
  • Work Complexity
  • Prefix Sum
  • Algorithms

Uploaded on | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.

E N D

Presentation Transcript


  1. Parallel Programming OpenMP, Scan, Work Complexity, and Step Complexity David Monismith CS599 Based upon notes from GPU Gems 3, Chapter 39 - http://http.developer.nvidia.com/GPUGems3/gpugem s3_ch39.html

  2. Work Complexity In terms of data how many or what order of operations (Big-Oh) are required to perform a particular algorithm Example summation Given an array of n elements, it takes O(n) operations to find the sum of those elements on a single core processor. We want parallel algorithms to be work efficient to take the same amount of work (operations) as the serial version. This means that a parallelized version of sum should take O(n) operations (additions) to finish.

  3. Reduction Work Complexity Recall we have studied reduction In class we have seen that a sum reduction takes O(n) work to run to completion. The sum reduction takes the same order of work (operations) to complete as the serial version. Therefore, we say a sum reduction is work- efficient.

  4. Prefix Sum (Scan) The prefix sum or scan operation is useful for computing cumulative sums. Example: Given the following array, compute the prefix sum. +---+---+---+---+---+---+ | 7 | 2 | 3 | 1 | 9 | 2 | +---+---+---+---+---+---+ 0 1 2 3 4 5

  5. Uses for Scan Sorting Lexical Analysis String Comparison Polynomial Evaluation Stream Compaction Building Histograms and Data Structures

  6. Exclusive Prefix Sum (Exclusive Scan) Example: Given the following array, compute the prefix sum. +---+---+---+---+---+---+ | 7 | 2 | 3 | 1 | 9 | 2 | +---+---+---+---+---+---+ 0 1 2 3 4 5 +---+---+---+---+---+---+ | 0 | 7 | 9 | 12| 13| 22| +---+---+---+---+---+---+ 0 1 2 3 4 5

  7. Inclusive Prefix Sum (Inclusive Scan) Example: Given the following array, compute the prefix sum. +---+---+---+---+---+---+ | 7 | 2 | 3 | 1 | 9 | 2 | +---+---+---+---+---+---+ 0 1 2 3 4 5 +---+---+---+---+---+---+ | 7 | 9 | 12| 13| 22| 24| +---+---+---+---+---+---+ 0 1 2 3 4 5

  8. Nave Prefix Sum (Exclusive Scan) Algorithm int arr[N] = {7, 2, 3, 1, 9, 2}; int prefixSum[N]; int i, j; prefixSum[0] = 0; for(i = 1; i < N; i++) prefixSum[i]=prefixSum[i-1]+arr[j-1];

  9. Nave Parallel Inclusive Scan (Hillis and Steele 1986) +---+---+---+---+---+---+ | 7 | 2 | 3 | 1 | 9 | 2 | +---+---+---+---+---+---+ 0 1 2 3 4 5 | \ | \ | \ | \ | \ | V V V V V V +---+---+---+---+---+---+ | 7 | 9 | 5 | 4 | 10| 11| +---+---+---+---+---+---+ 0 1 2 3 4 5 \ \ | \ | \ | | \ \| \| \| | \ \ \ |\ \ | \ | \ | \ | \| \| \| \| V V V V +---+---+---+---+---+---+ | 7 | 9 | 12| 13| 15| 15| +---+---+---+---+---+---+ 0 1 2 3 4 5 \ |\ \ |\ | |

  10. Parallel Scan Example Continued +---+---+---+---+---+---+ | 7 | 9 | 12| 13| 15| 15| +---+---+---+---+---+---+ 0 1 2 3 4 5 \ \ \ \ \ \ \ \ \ \ \ \ \ \ | | | | | | | | | | | | | | | | \ | | \| | \ \ |\ \ | \ | \| \| V V \ \ \ \ \ | | +---+---+---+---+---+---+ | 7 | 9 | 12| 13| 22| 24| +---+---+---+---+---+---+ 0 1 2 3 4 5

  11. Nave Parallel Scan const int N = 6; int log_of_N = log2(8); int arr[N] = {7, 2, 3, 1, 9, 2}; int scanSum[2][N]; int i, j; int in = 1, out = 0; memcpy(scanSum[out], arr, N*sizeof(int)); memset(scanSum[in], 0, N*sizeof(int)); int two_i = 0, two_i_1 = 0;

  12. Nave Parallel Scan for(i = 1; i <= log_of_N; i++) { two_i_1 = 1 << (i-1); two_i = 1 << i; out = 1 - out; in = 1 - out; for(j = 0; j < N; j++) { if(j >= two_i_1) scanSum[out][j] = scanSum[in][j] + scanSum[in][j - two_i_1]; else scanSum[out][j] = scanSum[in][j]; } }

  13. Adding OpenMP Parallelism for(i = 1; i <= log_of_N; i++) { two_i_1 = 1 << (i-1); two_i = 1 << i; out = 1 - out; in = 1 - out; #pragma omp parallel for private(j) shared(scanSum, in, out) for(j = 0; j < N; j++) { if(j >= two_i_1) scanSum[out][j] = scanSum[in][j] + scanSum[in][j - two_i_1]; else scanSum[out][j] = scanSum[in][j]; } }

  14. Nave Parallel Scan Analysis Not work efficient Performs O(n lg n) additions Note that the serial version performs O(n) additions Requires O(lg n) steps This means we can complete a scan in O(lg n) time if we have O(n lg n) processors

  15. Work Efficient Parallel Scan (Blelloch 1990) Goal build a scan algorithm that avoids the extra O(lg n) work Solution make use of balanced binary trees Build a balanced binary tree on the input data and sweep the tree to compute the prefix sum Consists of two parts Up Sweep Traverse tree from leaves to root computing partial sums during the traversal The root node (last element in the array) will hold the sum of all elements Down Sweep Traverse from root to leaves using the partial sums to compute the cumulative sums in place. For the exclusive prefix sum, we replace the root with zero

  16. +---+---+---+---+---+---+---+---+ | 7 | 9 | 3 | 13| 9 | 11| 8 | 36| +---+---+---+---+---+---+---+---+ ^ /| / | / | / | / | +---+---+---+---+---+---+---+---+ | 7 | 9 | 3 | 13| 9 | 11| 8 | 23| +---+---+---+---+---+---+---+---+ ^ ^ /| /| / | / | / | / | +---+---+---+---+---+---+---+---+ | 7 | 9 | 3 | 4 | 9 | 11| 8 | 12| +---+---+---+---+---+---+---+---+ ^ ^ ^ ^ / | / | / | / | / | / | / | / | +---+---+---+---+---+---+---+---+ | 7 | 2 | 3 | 1 | 9 | 2 | 8 | 4 | +---+---+---+---+---+---+---+---+ 0 1 2 3 4 5 6 7 Blelloch Up Sweep

  17. Blelloch Algorithm Up Sweep for(i = 0; i <= log_of_N_1; i++) { two_i_p1 = 1 << (i+1); two_i = 1 << i; for(j = 0; j < N; j+=two_i_p1) scanSum[j + two_i_p1 - 1] = scanSum[j + two_i - 1] + scanSum[j + two_i_p1 - 1]; }

  18. +---+---+---+---+---+---+---+---+ | 7 | 9 | 3 | 13| 9 | 11| 8 | 36| +---+---+---+---+---+---+---+---+ zero| V +---+---+---+---+---+---+---+---+ | 7 | 9 | 3 | 13| 9 | 11| 8 | 0 | +---+---+---+---+---+---+---+---+ \ / | \ / | Blelloch Down Sweep / \ | | zero / \ V V +---+---+---+---+---+---+---+---+ | 7 | 9 | 3 | 0 | 9 | 11| 8 | 13| +---+---+---+---+---+---+---+---+ \ / | \ / \ | / \ V V +---+---+---+---+---+---+---+---+ | 7 | 0 | 3 | 9 | 9 | 13| 8 | 24| +---+---+---+---+---+---+---+---+ \ /| \ /| \ /| \ /| / \| / \| / \| / \| V V V V V V V V +---+---+---+---+---+---+---+---+ | 0 | 7 | 9 | 12| 13| 22| 24| 32| +---+---+---+---+---+---+---+---+ 0 1 2 3 4 5 6 7 / | | V V

  19. Blelloch Algorithm Down Sweep scanSum[N-1] = 0; for(i = log_of_N_1; i >= 0; i--) { two_i_p1 = 1 << (i+1); two_i = 1 << i; for(j = 0; j < N; j+=two_i_p1) { long t = scanSum[j + two_i - 1]; scanSum[j + two_i - 1] = scanSum[j + two_i_p1 - 1]; scanSum[j + two_i_p1 - 1] = t + scanSum[j + two_i_p1 - 1]; } }

  20. Analysis This algorithm does complete the summation in O(n) operations, giving the appearance of efficiency. Actually, the algorithm requires 2*(n-1) additions and n-1 swaps to run to completion. For large arrays, this algorithm will indeed out- perform the Hillis and Steele Algorithm.

Related


More Related Content