Shared Virtual Memory Systems in Coherence and Communication

memory coherence in shared virtual memory systems n.w
1 / 26
Embed
Share

Explore the concepts of memory coherence, interprocess communication, and shared virtual memory systems presented by Kai Li, Paul Hudak, and Kuangyuan Chen. Learn about the challenges, algorithms, experiments, and conclusions related to memory coherence, as well as the design choices and coherence problems in shared virtual memory systems.

  • Memory Coherence
  • Virtual Memory Systems
  • Shared Memory
  • Interprocess Communication
  • Coherence Algorithms

Uploaded on | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.

E N D

Presentation Transcript


  1. Memory Coherence in Shared Virtual Memory Systems By Kai Li and Paul Hudak Presented by Kuangyuan Chen EECS 582 W16 1

  2. Outline Shared Virtual Memory Coherence Problem Memory Coherence Algorithms Experiments Conclusions EECS 582 W16 2

  3. Interprocess Communication remote procedure call? complex data structures and pointers? shared virtual memory Provide a single address space for all processors Programmers can use distributed memories as traditional shared memory. Both can be implemented using message passing primitives(send, recv) EECS 582 W16 3

  4. Shared Virtual Memory Map a single address space to multiple physical memories Page data between processors(as well as between disk and physical memory in one processor) Replicate data whenever possible View physical memories as caches of virtual storage Performance: unshared data and shared read-only data: fine writes to shared data? fine if locality is good Problem: memory coherence EECS 582 W16 4

  5. Coherence Problem fundamental reason: Multiple copies of the same data Suppose P1 in CPU1 and P2 in CPU2 map to the same virtual page and each has a copy of it. CPU1 CPU2 CPU1 CPU2 1: CPU1 update P1 to 2 1: CPU1 update P2 to 3 1: CPU1 updates P1 to 2 2: CPU1 tries to read P2 P1: 1->2 P2: 1->3 P1: 1 P1: 1 -> 2 P2: 1 Wrong data! Which to choose? Coherence: value returned by read is always the same as the value written by latest write. Ideas borrowed from Directory Coherence Protocol Usually only one writer is allowed at a time. EECS 582 W16 5

  6. Designing Shared Virtual Memory Design Choices Page Size Granularity of Network Communication Too small: Overhead of a single message Too large: Contention for accessing a page(false sharing) Application dependent! Coherence Algorithms How to synchronize physical pages across the system? Invalidation or write-broadcast? How to maintain the ownership of a virtual page? Ownership defines who respond to a request. Page table maintains ownership by setting read-write bits. Ownership should be unique but dynamic. Manager tracks the owners of all pages. Centralized or distributed? Fixed or dynamic? EECS 582 W16 6

  7. Memory Coherence Algorithms Manager CPU Ptable Centralized Manager owner access copyset 4: Send page Unlock cpu1.ptable[p] unLock cpu3.Ptable[p] CPU1 CPU1 CPU2 CPU2 CPU3 CPU3 Write(Owner) NIL Read NIL Read(Owner) NIL Lock cpu1.Ptable[p] Lock cpu3.Ptable[p] 2: Invalidate local copy Why do we need confirmation message? Manager Manager 3: Forward Request 1: Write Fault Request 1. Manager synchronizes the ownership. Manager must know that request is completed before processing the next. CPU3 CPU1 Lock manager 5: Confirmation 2. {CPU2} {} unLock manager EECS 582 W16 7

  8. Memory Coherence Algorithms Manager CPU Ptable Improved Centralized Manager Owner itself can synchronize ownership Collocate copyset with the owner owner copyset access 3: Send page and copyset CPU1 CPU1 CPU1 CPU2 CPU2 CPU3 CPU3 N/A {CPU2} {} N/A N/A {CPU2} N/A 4: Invalidation NIL NIL Write(Owner) Read NIL Read(Owner) NIL Manager Manager 2: Forward Request 1: Write Fault Request CPU1 CPU3 Contention: all faults to all pages go to a single manager. EECS 582 W16 8

  9. Memory Coherence Algorithms Fixed Distributed Manager Distribute Manager by Interleaving Page Number Can still suffer from contention since manager is statically distributed. CPU2 CPU1 CPU3 NIL NIL Write(Owner) P0:Manager P1:Manager P2:Manager P3:Manager P4:Manager P5:Manager EECS 582 W16 9

  10. Memory Coherence Algorithms CPU Ptable Dynamic Distributed Manager Merge Manager into Ptable Ptable entry probably knows the owner. probOwner copyset access 2: Send Page CPU2 CPU2 CPU1 CPU1 CPU3 CPU3 CPU2 CPU3 CPU3 CPU1 Self Self N/A N/A N/A N/A {CPU3} {CPU1, CPU3} NIL NIL NIL Read Write Read 1: Read Fault Request 2: Forward Request EECS 582 W16 10

  11. Memory Coherence Algorithms Can a request always find the true owner? Theorem 1 A page fault on any processor reaches the true owner of the page using at most N-1 forwarding requests. EECS 582 W16 11

  12. Memory Coherence Algorithms a glance at the proof probOwner graph Gp: processors point to their probOwner Gp is directed and acyclic (rooted tree) after initialization. All processors point to the true owner. The graph modified by any requests is still a rooted tree. P3 P0 P0 P0 Maximum path length of a tree: N-1 P1 P1 P1 P1 P0 P4 P4 P4 P2 P2 P2 P3 P3 P3 P2 P4 1: P3 send request 2: P1 changes probOwner and forward request 3: P0 changes probOwner and send page to P3 4: ProbOwner Graph after a request EECS 582 W16 12

  13. Experiments Speedup: time on single proc divided by time on SVM system Four Parallel Computing Programs 3D PDE Parallel Sort Matrix Multiplication Dot Product Comparison of Coherence Algorithms EECS 582 W16 13

  14. 3D-PDE Solving a Linear Equation CPU1 CPU2 EECS 582 W16 14

  15. 3D-PDE Solid: experiment Dashed: Ideal Superlinear speedup? EECS 582 W16 15

  16. Superlinear Speedup Solid: single processor Dashed: processor w/ init data in two-processor system Dotted: processor w/o init data in two-processor system Solid: experiment Dashed: Ideal Number of disk pages remains high in single-processor case. Reducing data size results in sublinear speedup. Number of disk pages quickly decreases as starting execution. Conclusion: 1. In single-processor case, system suffers from thrashing. 2. The working sets do not fit into one processor s memory while they fit into two processors memory. EECS 582 W16 16

  17. 3D-PDE More physical memories reduces thrashing. Program exhibits a high degree of locality. A A b b 1 1 = x 2 2 EECS 582 W16 17

  18. Parallel Sort Block Odd-Even Merge-Split Sort Block 4 Block 1 Block 2 Block 5 Block 6 Block 7 Block 0 Block 3 T0-even: T0-odd: Merge by CPU1 T1-even: Merge by CPU2 T1-odd: time Merge by CPU3 T2-even: Merge by CPU4 T2-odd: T3-even: T6-odd: EECS 582 W16 18

  19. Parallel Sort Solid: experiment Dashed: theoretical Theoretical Speedup is sublinear. log n N n n N n N n N + log( ) * N Alternating odd-even block results in large amount of data transfer. EECS 582 W16 19

  20. Matrix Multiplication CPU1 CPU2 A A AB A B AB A B ( ) 1 1 1 1 2 = B B 1 2 2 2 1 2 2 CPU3 CPU4 EECS 582 W16 20

  21. Matrix Multiplication Solid: experiment Dashed: Ideal Most memory references are read-only. Program exhibits a high degree of locality. A A AB A B AB A B ( ) 1 1 1 1 2 = B B 1 2 2 2 1 2 2 EECS 582 W16 21

  22. Dot Product CPU1 CPU2 CPU3 CPU4 CPU1 EECS 582 W16 22

  23. Dot Product Dotted: input data located on one processor Solid: input data randomly distributed Dashed: Ideal Data are only read once(no temporal locality). Large communication-to-computation ratio. n = S x y i i 1 EECS 582 W16 23

  24. Comparing Coherence Algorithms Compares number of forward requests in 3D PDE for three algorithms fixed dist. and impr. cent. are similar Both need a forward request from manager to the owner dyn. dist is much better probOwner in dyn. dist. usually gives the correct hints. EECS 582 W16 24

  25. Conclusions Which algorithm is the best? Algorithm Centralized Fixed Distributed Dynamic Distributed Benefits 1. Simple and easy to implement 1. Less contention on distributed managers. 1. Manager is dynamically distributed. 2. Best-case number of requests is one. Drawbacks 1. Always need a forward request 2. Centralized manager is the bottleneck. 1. Always need a forward request 2. Static distribution. 1. Number of requests is undeterministic. What kind of programs benefit from SVM? Good locality of reference Read-only reference Low communication-to-computation ratio EECS 582 W16 25

  26. Q&A? EECS 582 W16 26

Related


More Related Content