Efficient Parallel Algorithms Theory and Practice for I/O Models

parallel algorithms theory and practice n.w

1 / 58

Embed Share

Explore the concepts of locality and I/O-efficient parallel algorithms, including the two-level memory hierarchy, I/O model complexities, I/O-efficient algorithms, multi-way merge sorting in the EM model, and tiling MatMul techniques for optimization.

swop_yne Follow

Uploaded on Apr 03, 2025 | 5 Views

Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.

Download Presentation

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.

E N D

Presentation Transcript

Parallel Algorithms: Theory and Practice CS260 CS260 Lecture 11 Lecture 11 Yan Gu Yan Gu Locality and I/O-Efficient Parallel Algorithms

Last week -The I/O model Two Two- -level memory hierarchy: A small memory (fast memory, cache) of fixed size ? A large memory (slow memory) of unbounded size Both are partitioned into blocks of size Instructions can only apply to data in primary memory, and are free level memory hierarchy: Both are partitioned into blocks of size ? Instructions can only apply to data in primary memory, and are free Slow Memory Fast Memory 0 CPU ?/? ?

Last week -The I/O model The I/O model has two special Read transfer Write transfer The complexity of an algorithm on the I/O model (I/O complexity) is measured by: #(read transfers) + #(write transfers) The I/O model has two special memory transfer Read transfer: load a block from slow memory Write transfer: write a block to slow memory The complexity of an algorithm on the I/O model (I/O complexity) is measured by: #(read transfers) + #(write transfers) memory transfer instructions: instructions: Slow Memory Fast Memory 1 0 CPU ?/? 1 ?

Last week -I/O-efficient algorithms Problem Problem RAM Algorithm RAM Algorithm I/O algorithm I/O algorithm ? ?log? ?2 ? ? Sorting ? ?log? ? ? ? ?2 Matrix transpose ? ? ?3 ? ?3 Matrix multiplication ? ? ? Search ?(log2?) ?(log??) 5

Last week -Multi-way mergesorton the EM model Recursively merge Recursively merge ?/? sorted lists, in sorted lists, in log?/??/? rounds rounds ? log?/??/? ?/(?/?) ?2/? ?

Last week -Tiling MatMul := Z X Y for i = 1 to N/s do for j = 1 to N/s do for k = 1 to N/s do for i = (i-1)*s+1 to i*s do for j = (j-1)*s+1 to j*s do for k = (k-1)*s+1 to k*s do Z[i ][j ] += X[i ][k ] * Y[k ][j ] ? ?? ? ? Plug in ? = Plug in ?/?, # , #subMM subMM is is ? = ? ??/? ?? ? ? I/O per I/O per subMM subMM is is ? ?/? , so overall I/O cost is , so overall I/O cost is ?

Last week -B-tree/(2,3)-tree Objects stored in leaves Leaves all have same depth Root has at most Other internal nodes have between Objects stored in leaves Leaves all have same depth Root has at most ? children Other internal nodes have between ?/2 and B children and ? children children 18 35 6 10 25 43 50 1 2 4 10 12 15 18 21 25 28 35 37 50 51 53 6 8 43 45

Last week -B-tree search Compare search key against partitioning keys and move to proper child. Cost is Compare search key against partitioning keys and move to proper child. Cost is O(height Search for 21 O(height * node size) * node size) 18 35 6 10 25 43 50 1 2 4 10 12 15 18 21 25 28 35 37 50 51 53 6 8 43 45

Last week -I/O-efficient algorithms Problem Problem RAM Algorithm RAM Algorithm I/O algorithm I/O algorithm ? ?log? ?2 ? ? Sorting ? ?log? ? ? ? ?2 Matrix transpose ? ? ?3 ? ?3 Matrix multiplication ? ? ? Search ?(log2?) ?(log??) 10

Problems of I/O model and I/O algorithms How about parallelism? How about parallelism? What about multiple What about multiple- -level memory hierarchy? level memory hierarchy? Algorithm is machine specific. Need to rewrite code when running on another machine Algorithm is machine specific. Need to rewrite code when running on another machine What about when What about when ? and and ? are not static? are not static? 11

Parallel Algorithms: Theory and Practice CS260 CS260 Lecture 12 Lecture 12 Yan Gu Yan Gu Cache-Oblivious Algorithms

The I/O model Abstracts a single level of the memory hierarchy Fast memory (cache) of size Accessing fast memory is free, but moving data from slow memory is expensive Memory is grouped into size Fast Memory Abstracts a single level of the memory hierarchy Fast memory (cache) of size ? Accessing fast memory is free, but moving data from slow memory is expensive Memory is grouped into size- -? blocks blocks of contiguous data Slow Memory of contiguous data 1 0 CPU ?/? 1 ? Cost: the number of block transfers (or I/Os) from slow memory to fast memory

Cache-Oblivious Algorithms Algorithms not parameterized by These algorithms are unaware of the parameters of the memory hierarchy Analyze in the except optimal replacement is assumed Algorithms not parameterized by ? or or ? Analyze in the ideal cache except optimal replacement is assumed ideal cache model model same as the I/O model same as the I/O model Slow Memory Fast Memory 1 0 CPU ?/? 1 ? Question: how do I know or analyze based on ideal cache- replacement policy?

Cache-Oblivious Algorithms Algorithms not parameterized by These algorithms are unaware of the parameters of the memory hierarchy Analyze in the except optimal replacement is assumed Algorithms not parameterized by ? or or ? Analyze in the ideal cache except optimal replacement is assumed ideal cache model model same as the I/O model same as the I/O model Use a specific cache sequence to analyze the I/O cost An ideal cache will do no worse than this specific load/evict sequence In practice, real caches based on LRU policy perform similarly to the optimal cache Use a specific cache sequence to analyze the I/O cost An ideal cache will do no worse than this specific load/evict sequence

Cache-oblivious algorithms Problem Problem RAM Algorithm RAM Algorithm CO algorithm CO algorithm ? ?log? ?2 ? ? Sorting ? ?log? ? ? ? ?2 Matrix transpose ? ? ?3 ? ?3 Matrix multiplication ? ? ? Search ?(log2?) ?(log??) 16

Matrix transpose If across iterations ??I/Os If ? > ?, no locality across iterations I/Os , no locality 1 2 3 4 5 2 3 4 5 for i = 1 to N do for j = i+1 to N do swap(A[i][j], A[j][i]) The simplest implementation is not I/O efficient assuming the matrix is stored in row The simplest implementation is not I/O efficient assuming the matrix is stored in row- -major major

Matrix transpose -I/O algorithm ?/2 1 2 2 The simplest implementation is not I/O efficient The I/O algorithm has a cost of The simplest implementation is not I/O efficient The I/O algorithm has a cost of (?2/?)

Matrix transpose -cache-oblivious algorithm

Matrix transpose -cache-oblivious algorithm

Array storage How many If it s aligned on a block (usually true for cache-aware), it takes exactly ?/? blocks How many blocks blocks does a size does a size- -? array occupy? array occupy? block If you re unlucky, it s ?/? + 1 blocks. This is generally what you need to assume for cache-oblivious algorithms as you can t force alignment A size-? array occupies at most ?/? + 1 = 1 + ?/? blocks, since you don t want to control cache alignment

Matrix transpose -Cache-oblivious algorithm the matrix layout of matrix in slow memory ? ? A size-? array occupies at most ?/? + 1 = 1 + ?/? blocks, since you cannot control cache alignment

Matrix transpose -cache-oblivious algorithm the matrix layout of matrix in slow memory ? ? ? ? The total number of blocks to store submatrix is ? ?/? + 1 = ? + ?2/? blocks

Matrix transpose -cache-oblivious algorithm Want: ? ?/? + 1 ?/? If assuming ? ?2, then ? ? ? for some ? like This is called the Tall Tall- -Cache Assumption Cache Assumption The total number of blocks to store submatrix is ? ?/? + 1 = ? + ?2/? blocks

Matrix transpose -cache-oblivious algorithm The simplest implementation is not I/O efficient The I/O algorithm has a cost of The cache The simplest implementation is not I/O efficient The I/O algorithm has a cost of (?2/?) The cache- -oblivious algorithm also has a cost of oblivious algorithm also has a cost of (?2/?)

Cache-Oblivious Algorithms Algorithms not parameterized by These algorithms are unaware of the parameters of the memory hierarchy Analyze in the except optimal replacement is assumed Algorithms not parameterized by ? or or ? Analyze in the ideal cache except optimal replacement is assumed ideal cache model model same as the I/O model same as the I/O model Use a specific cache sequence to analyze the I/O cost An ideal cache will do no worse than this specific load/evict sequence In practice, real caches based on LRU policy perform similarly to the optimal cache Use a specific cache sequence to analyze the I/O cost An ideal cache will do no worse than this specific load/evict sequence

Problems of I/O model and I/O algorithms What about multiple Apply to all cache levels What about multiple- -level memory hierarchy? level memory hierarchy? Algorithm is machine specific. Need to rewrite code when running on another machine No, you don t Algorithm is machine specific. Need to rewrite code when running on another machine What about when The algorithm will adapt automatically What about when ? and and ? are not static? are not static? How about parallelism? Still use work-span(depth) analysis, often divide&conquer How about parallelism? 27

Last week cache-oblivious algorithms Problem Problem RAM Algorithm RAM Algorithm CO algorithm CO algorithm ? ?log? ?2 ? ? Sorting ? ?log? ? ? ? ?2 Matrix transpose ? ? ?3 ? ?3 Matrix multiplication ? ? ? Search ?(log2?) ?(log??) 28

Matrix Multiplication Consider standard iterative matrix Consider standard iterative matrix- -multiplication algorithm multiplication algorithm := Z X Y Where Where ?, ?, and and ? are are ? ? matrices matrices for i = 1 to N do for j = 1 to N do for k = 1 to N do Z[i][j] += X[i][k] * Y[k][j] (?3) computation in RAM model. What about I/O? computation in RAM model. What about I/O?

How Are Matrices Stored? How data is arranged in memory affects I/O performance Suppose How data is arranged in memory affects I/O performance Suppose ?, , ?, and , and ? are in row are in row- -major order major order := Z X Y If ? > ?, no locality across iterations for X and Y (?3) I/Os for i = 1 to N do for j = 1 to N do for k = 1 to N do Z[i][j] += X[i][k] * Y[k][j] If ? ?, reading a column of Y is expensive (?) I/Os

How Are Matrices Stored? Suppose column Not too inconvenient. Transposing ? has (?2/?) cost Suppose ? and column- -major order and ? are in row major order are in row- -major order but major order but ? is in is in := Z X Y for i = 1 to N do for j = 1 to N do for k = 1 to N do Z[i][j] += X[i][k] * Y[k][j] We can do much better than are row No locality across iterations for Y ?3/? I/Os Scan row of X and column of Y (?/?) I/Os Os, even if all matrices We can do much better than (?3/?) I/ I/Os are row- -major , even if all matrices major

Tiling := Z X Y for i = 1 to N/s do for j = 1 to N/s do for k = 1 to N/s do for i = (i-1)*s+1 to i*s do for j = (j-1)*s+1 to j*s do for k = (k-1)*s+1 to k*s do Z[i ][j ] += X[i ][k ] * Y[k ][j ] ? ?? ? ? Plug in ? = Plug in ?/?, # , #subMM subMM is is ? = ? ??/? ?? ? ? I/O per I/O per subMM subMM is is ? ?/? , so overall I/O cost is , so overall I/O cost is ?

Recursive Matrix Multiplication Compute 8 submatrix products recursively Z11:= X11Y11+ X12Y21 Z12:= X11Y12+ X12Y22 Z21:= X21Y11+ X22Y21 Z22:= X21Y12+ X22Y21 Z11 Z21Z22 Z12 X11 X21X22Y21 X12 Y11 Y12 Y22 := When all three submatrices cache, the algorithm loads them all from the slow memory, apply the multiply, and update the main memory, with the cost of When all three submatrices ? , , ? and cache, the algorithm loads them all from the slow memory, apply the multiply, and update ? back to the main memory, with the cost of ?/? and ? fit into fit into back to

Recursive Matrix Multiplication Compute 8 submatrix products recursively Z11:= X11Y11+ X12Y21 Z12:= X11Y12+ X12Y22 Z21:= X21Y11+ X22Y21 Z22:= X21Y12+ X22Y21 Z11 Z21Z22 Z12 X11 X21X22Y21 X12 Y11 Y12 Y22 := ???? = 8????/2 Base case: Base case: ??? ? = ?/? Solving it gives Solving it gives ???? = ?3/? ? + ?2/?

Recursive Matrix Multiplication Compute 8 submatrix products recursively Z11:= X11Y11+ X12Y21 Z12:= X11Y12+ X12Y22 Z21:= X21Y11+ X22Y21 Z22:= X21Y12+ X22Y21 Z11 Z21Z22 Z12 X11 X21X22Y21 X12 Y11 Y12 Y22 := Solving it gives Solving it gives ???? = ?3/? ? + ?2/? With a careful implementation, the parallel depth can be With a careful implementation, the parallel depth can be ? log2?

Algorithms Similar to Matrix Multiplication Linear algebra: Gaussian elimination, LU decomposition, triangular system solver, QR decomposition Dynamic Programming: Floyd-Warshall for APSP, edit distance (longest common sequences), 1D clustering, and many applications in molecular biology and geology (e.g. protein folding/ editing, many structure problems) Linear algebra: Dynamic Programming:

Searching: binary search is bad A A B B D D H H C E F G I J K L M N O Example: binary search for element A with block size ? = 3 Search hits a different block until reducing keyspace Search hits a different block until reducing keyspace to size to size (?) Thus, total cost is ?)) (log2?) for Thus, total cost is log2? (log2?) = (log2(?/ for ? ?

Static cache-oblivious searching Goal: organize searching. (van 1. 2. half the height Goal: organize ? keys in memory to facilitate efficient searching. (van Emde 1. build a balanced binary tree on the keys 2. layout the tree recursively in memory, splitting the tree at half the height keys in memory to facilitate efficient Emde Boas layout) build a balanced binary tree on the keys layout the tree recursively in memory, splitting the tree at Boas layout) ? ? ? memory layout ? ?

Static layout example recurse 0 H 1 2 D L 9 12 3 6 B F J N M A 4 C 5 E 7 G 8 I K 11 O 14 10 13 H D L B A C F E G J I K N M O

Cache-oblivious searching: Analysis I Consider recursive subtrees on a root path. Each contiguous and fits in ? 1 blocks. Each (log2?), so there are (log??) of them. Consider recursive subtrees of size on a root- -to path. Each subtree contiguous and fits in blocks. Each subtree , so there are of them. of size ? to to- -leaf search to ? leaf search subtree is is subtree has height has height size-?(?) subtree memory size-B block

Cache-oblivious searching: Analysis I Consider recursive subtrees on a root path. Each contiguous and fits in ? 1 blocks. Each (log2?), so there are (log??) of them. Consider recursive subtrees of size on a root- -to path. Each subtree contiguous and fits in blocks. Each subtree , so there are of them. of size ? to to- -leaf search to ? leaf search subtree is is subtree has height has height size-?(?) subtree memory size-B block

Cache-oblivious searching: Analysis II ? ? ? memory layout ? ? Analyze using a recurrence ?(?) = 2? ? Base case: ?(?) = 1 Solves to ?(log??)

Summary for cache-oblivious searching Use a special (van the memory footprint from ?(log??) blocks Use a special (van Emde the memory footprint from ?(log2?) blocks to blocks Emde Boas) layout to reduce Boas) layout to reduce blocks to The similar idea can be applied to more complicated scenarios like the dynamic setting ( (??- -tree The similar idea can be applied to more complicated scenarios like the dynamic setting tree) )

Last week cache-oblivious algorithms Problem Problem RAM Algorithm RAM Algorithm CO algorithm CO algorithm ? ?log? ?2 ? ? Sorting ? ?log? ? ? ? ?2 Matrix transpose ? ? ?3 ? ?3 Matrix multiplication ? ? ? Search ?(log2?) ?(log??) 44

Sample-sort outline Analogous to Analogous to multiway multiway quicksort quicksort 1. 1. Split input array into subarrays recursively Split input array into ? contiguous subarrays of size recursively contiguous . Sort of size ?. Sort subarrays subarrays ?, sorted ?

Sample-sort outline Analogous to Analogous to multiway multiway quicksort quicksort ?, sorted 1. 1. Split input array into subarrays recursively Split input array into ? contiguous subarrays of size recursively contiguous . Sort of size ?. Sort subarrays subarrays

Sample-sort outline 2. 2. Choose ?1 ?2 ?? 1 Choose ? 1 good pivots good pivots ?, sorted 3. 3. Distribute subarrays into buckets pivots Distribute subarrays into buckets, according to pivots , according to Size ? ?1 ?2 ?? 1 Bucket 1 Bucket 2 Bucket ?

Sample-sort outline 4. 4. Recursively sort the buckets Recursively sort the buckets ?1 ?2 ?? 1 Bucket 1 Bucket 2 Bucket ? 5. 5. Copy concatenated buckets back to input array Copy concatenated buckets back to input array sorted

Sample-sort analysis sketch Step 1 (implicitly) divides array and sorts subproblems Step 4 sorts size Claim: Step 2, 3 and 5 uses Step 1 (implicitly) divides array and sorts ? size subproblems Step 4 sorts ? buckets of size size ? Claim: Step 2, 3 and 5 uses (?/?) work size- - ? buckets of size ?? ?, with total , with total work ? ? = ? ? ? + ? ?? + ?/? ? + ?/? 2 ? ? ? ?log?? =

Sample-sort outline 2. 2.Choose ?? 1 Choose ? 1 good pivots good pivots ?1 ?2 Can be achieved by randomly pick ? ?log? random samples, sort them and pick the every ?log? -th element This step requires ? ?/? operations

Efficient Parallel Algorithms Theory and Practice for I/O Models

Download Presentation

Presentation Transcript

Related

More Related Content