Understanding Synchronization and Atomicity in Programming

synchronization and atomicty n.w
1 / 13
Embed
Share

"Explore the concepts of synchronization and atomicity in programming, covering race conditions, thread coordination, atomic operations, and more. Learn how to ensure data consistency and thread safety in parallel computing environments."

  • Programming
  • Synchronization
  • Atomicity
  • Parallel Computing

Uploaded on | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.

E N D

Presentation Transcript


  1. SYNCHRONIZATION AND ATOMICTY

  2. Coordination is required: Race conditions 2 x = x + 1; time Processor 1 Processor 2 17 Load x from memory into register 17 17 Load x from memory into register 18 17 Add 1 to register Store register into memory x 18 Add 1 to register Store register into memory x 18 18 Incorrect: Should have gotten 19

  3. Syncthreads and variants (warning: if any threads in the block don t execute that call, will wait for ever and kernel will never finish) Synchronizing across block generally requires exiting the kernel then starting another kernel

  4. Threadfence and variants THIS ISN T FINISHED

  5. Atomic operations Look like a C function call Performs some operation in an indivisible manner, as if no other thread ran at the same time From Greek a tomos = no cuts = indivisible It s not implemented that way: it would be too slow Simple operation only accesses a single address Read - operate - write Hardware prevents another thread from accessing the address until the read-operate-write is complete

  6. Incrementing with AtomicAdd 6 void atomicAdd(int* address, int val atomicAdd(&x, 1); time Processor 1 Processor 2 17 Load x from memory into register 17 18 Add 1 to register Store register into memory x 18 18 Load x from memory into register Add 1 to register Store register into memory x 19

  7. CUDA atomic operations Atomic across all threads: not just threads in a block or kernel Only available in device functions (kernels) Arithmetic: add, sub, max, min, increment, decrement Exchange two values Bitwise and, or, xor (xor gives not (Why?)) int atomicCAS(int* address, int compare, int val) compare and set *address = (*address == compare ? val : old) Lockless algorithms for data structures Guru-level

  8. Atomic operations performance

  9. Atomic operations performance

  10. Some ordering of memory read/writes within a block void __threadfence_block(); ensures that: All writes to shared and global memory made by the calling thread before the call to __threadfence_block() are observed by all threads in the block of the calling thread as occurring before all writes to shared memory and global memory made by the calling thread after the call to __threadfence_block(); All reads from shared memory and global memory made by the calling thread before the call to __threadfence_block() are performed before all reads from shared memory and global memory made by the calling thread after the call to __threadfence_block().

  11. Some ordering of memory read/writes within all blocks of a kernel void __threadfence(); acts as __threadfence_block() for all threads in the block of the calling thread and also ensures that all writes to global memory made by the calling thread before the call to __threadfence() are observed by all threads in the device as occurring before all writes to global memory made by the calling thread after the call to __threadfence().

  12. A (maybe slow) way to coordinate between blocks Problem: Want to the last block to finish execution to perform some task In .cu file: __device__ unsigned int count = 0; // Does this need to be reinitialized to 0 for every launch? __shared__ bool isLastBlockDone; In kernel: 1D grid last block to of blocks isBlockDone // All threads do some work // thread 0 signals that it the block is finished if (threadIdx.x == 0) { unsigned int prevCount= atomicInc(&count, gridDim.x); // assumes // prevCount is the *previous* value of count, so this is the // finish if and only if prevCount is one less than the number } __syncthreads(); // Make sure all threads read correct value of isLastBlockDone = (prevCount== (gridDim.x - 1));

  13. Discussion How could atomicAdd() be used in arbitrary length reduction by + kernel? How could threadfence_block() Would either of these be an improvement? How could we find out?

Related


More Related Content