 
										Understanding POSIX Threads (Pthreads) and Shared Memory Models in C++
Learn about POSIX threads (Pthreads) for shared address space programming, creating, joining, and executing threads using C++ with examples and explanations. Dive into basic locks and a practical "Hello World" program with pthreads in C++.
Download Presentation
 
								Please find below an Image/Link to download the presentation.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.
You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.
E N D
Presentation Transcript
- Other Shared Memory Models Posix Threads C++11 Atomics Cilk Tasks 1 
- POSIX Threads (Pthreads) for Shared Address Space Programming 
- Pthreads Posix threads package Available on almost all machines (portable standard) Sort of like doing parallel (not parallel for ) in OpenMP explicitly Basic calls: pthread_create: creates a thread, to execute a given function Pthread_join barrier, lock, mutex Thread private variables Many online resources: E.g., https://computing.llnl.gov/tutorials/pthreads/ L.V.Kale 3 
- Pthreads Create and Join Spawn an attached thread pthread_create (&thread1, NULL, foo, &arg) . . . pthread_join(thread1, status) Detached threads Join is not needed The OS destroys thread resources when they terminate A parameter in the create call indicates a detached thread Thread execution void foo(&arg) { // Thread code return(*status); } L.V.Kale 4 
- Executing a Thread Main Program . . . . Thread1 stack Thread1 void * func (void *arg) { pthread_create(&thread1, NULL, func, &arg); . . . . . . . . return (status); Main stack } pthread_join(thread1, status); . . . . L.V.Kale 5 
- Basic Locks Declare a lock: pthread_mutex_t mutex; Initialize a mutex pthread_mutex_init(&mutex, NULL); Enter and release pthread_mutex_lock(&mutex); and pthread_mutex_unlock(&mutex); Try lock without blocking: pthread_mutex_trylock(&mutex); Returns 0 if successful (i.e. lock is acquired) Release resources pthread_mutex_destroy(mutex); // Use defaults L.V.Kale 6 
- Hello World: Pthreads #include <stdlib.h> int main(int argc,char **argv) { long threads = strtol(argv[1], NULL, 10); pthread_t *threadHandles = malloc(threads* sizeof(pthread_t)); long *ids =(long* )malloc(sizeof(long) * threads); for (long t=0; t<threads; t++) { ids[t]=t; pthread_create(&threadHandles[t], NULL, Hello, (void *)&(ids[t])); } printf( Hello from the main thread\n ) ; for(long t=0; t<threads; t++) pthread_join(threadHandles[t], NULL); free(threadHandles); free(ids); } #include <stdlib.h> #include <stdio.h> #include <pthread.h> long *id = (long* )(myRank); printf( Hello from thread %ld\n , *id); return NULL; } #include <stdio.h> void* Hello(void* myRank) { #include <pthread.h> void* Hello(void* myRank) { long *id = (long* )(myRank); printf( Hello from thread %ld\n , *id); return NULL; } int main(int argc,char **argv) { long threads = strtol(argv[1], NULL, 10); pthread_t *threadHandles = malloc(threads* sizeof(pthread_t)); long *ids =(long* )malloc(sizeof(long) * threads); for (long t=0; t<threads; t++) { ids[t]=t; pthread_create(&threadHandles[t], NULL, Hello, (void *)&(ids[t])); } printf( Hello from the main thread\n ) ; for(long t=0; t<threads; t++) pthread_join(threadHandles[t], NULL); free(threadHandles); free(ids); } L.V.Kale 8 
- Threads and Resources Suppose you are running on a machine with K cores Each core may have 2 hardware threads (SMT) This is often called hyperthreading on SMT (simultaneous multi-threading) How many pthreads can you create? Unlimited (well the system may run out of resources like memory) Can be smaller or larger than K In performance oriented programs, its rarely more than 2K (assuming 2-way SMT) We want to prevent OS from swapping out our threads Which cores does each thread run on? By default: any (i.e., OS suspends each running thread every few ms, and runs another thread) L.V.Kale 9 
- Affinity Which cores does each thread run on? By default: any (i.e., OS suspends each running thread every few ms, and runs another thread) Even if you have fewer threads than the hardware threads But that s bad for cache locality Caches will be polluted by the work by other threads ... you will do a cold start almost always when you get scheduled every few ms Pthreads provide a way for binding threads to hardware resources for this purpose L.V.Kale 10 
- Pthread Affinity Set-affinity (or pinning) assigns a thread to a set of hardware threads Can use topological info to pin to core, sockets, NUMA domains, etc. A library that provides such information is hwloc Example pattern of usage ... cpu_set_t cpuset; CPU_ZERO(&cpuset); CPU_SET(PEnum, &cpuset); // can be called multiple times pthread_setaffinity_np(pthread_self(), sizeof(cpu_set_t), &cpuset) ... L.V.Kale 11 
- OpenMP vs. Pthreads OpenMP is great for parallel loops And for many simple situations with just #pragma omp parallel as well But when there is complicated synchronization, and performance is important, pthreads are (currently) better However, pthreads are not available on all machines/OS s Especially Windows L.V.Kale 12 
- Performance Oriented Programming in Pthreads Pthreads as used in OS programming don t need to be as performance oriented as what we need in HPC E.g., synchronizing every few microseconds I.e., exchanging data or waiting for signals Improving performance: Always use affinity Decide the number of pthreads to avoid any over-subscription and use SMT only if memory bandwidth (and floating point intensity) permit Minimize barriers, using point-to-point synchronization as much as possible (say, between producer and a consumer, as in Gauss-Seidel) Reduce cross-core communication (it s much better to use the data produced on one core on the same core if/when possible) Locks cause serialization of computation across threads L.V.Kale 13 
- C++11 Atomics Wait-free Synchronization and Queues 
- Recall.. Why the following doesn t work Initially: x, Flag, are both 0 Thread 0: Thread 1: x = 25; while (Flag == 0) ; Flag = 1; Print x; What should get printed? L.V.Kale 15 
- Sequential Consistency This is a desired property of parallel programming systems The effect of executing a program consisting of k threads should be the same as some arbitrary interleaving of statements executed by each thread, executed sequentially Modern processors do not satisfy sequential consistency! L.V.Kale 16 
- PE0 PE1 PEp-1 . . . , , Arbitrator Memory L.V.Kale 17 
- Support for memory consistency models OpenMP provided a flush primitive for dealing with this issue Ensures variables are written out to memory and no reordering of instructions happen across the flush call With Pthreads, in the past, you d need to use processor-specific memory-fence operations On Intel On PowerPC Load-linked-store-conditional, etc. C++ recently standardized this C++-11-atomics in C++-11 Standard Supports sequential consistency as well as specific relaxed consistency model L.V.Kale 18 
- C++-11 atomics http://en.cppreference.com/w/cpp/atomic/atomic Basic: Declare some (scalar) variables as atomic, This ensures accesses to those variables, among themselves, are sequentially consistent If one thread writes to an atomic object while another thread reads from it, the behavior is well-defined (memory model defines it) #include <atomic> Declarations std::atomic<T> atm_var std::atomic<T*> atm_ptr std::atomic <T>* atm_array L.V.Kale 19 
- Atomic: C++ 11 atomic atomic class template and specializations for bool, int, and pointer type atomic_store atomically replaces the value of the atomic object with a non-atomic argument atomic_load atomically obtains the value stored in an atomic object atomic_fetch_add adds a non-atomic value to an atomic object and obtains the previous value of the atomic atomic_compare_exchange_strong atomically compares the value of the atomic object with non-atomic argument and performs atomic exchange if equal or atomic load if not Source: https://en.cppreference.com/w/cpp/atomic L.V.Kale 20 
- atomic_compare_exchange_strong(a,b,c) b a c ?x = 7 5 5 0 b a c ? = 0 7 7 0 L.V.Kale 21 
- C++11 Atomics Avoiding Locks, serialization and Queues with Atomics 
- Locks, Serialization, and Wait-Free Synchronization Locks are an established way of enforcing mutual exclusion It also enforces aspects of sequential consistency: Memory operations are not moved across lock or unlock calls by the compiler Hardware is made to ensure all writes are completed at lock() or unlock() But locks are expensive, because they cause serialization L.V.Kale 24 
- Locks, Critical sections and serialization Suppose all threads are doing the following The work in dowork() is tw The time in critical is tc The serialization cost becomes a problem as the number of threads increase, but can be small up to #threads < tw/tc for I = 0, N dowork(); lock(x); critical .. ; unlock(x) L.V.Kale 25 
- tc tw tc tw tw tc tw tc tw tc tw tw tw tc tw tc tc tw tc tw tw tw tc tc tw tw tc tw tc tc tw tw tw tc tc tw tw tc tw tw tc tw tw tc tw tw tc tw tw tw tw tc L.V.Kale 26 
- Locks, Serialization, and Wait-Free Synchronization Locks are an established way of enforcing mutual exclusion It also enforces aspects of sequential consistency: Memory operations are not moved across lock or unlock calls by the compiler Hardware is made to ensure all writes are completed at lock() or unlock() But locks are expensive, because they cause serialization Still, for most practical situations, locks are fast enough Just use locks and avoid all the trouble, in practice Unless you are in a fine grained situation with many threads I.e., computation between consecutive calls to lock is very short Then, consider a wait-free implementation with atomics L.V.Kale 27 
- An Aside: Lock-Free Algorithms Early days of computer science, there were many research papers and textbook materials on lock-free algorithms Peterson s, Dekker s These algorithms all depended on sequential consistency, which processors of the day might have supported That is no longer true, and so those algorithms are mostly not useful May occasionally provide inspiration for a wait-free algorithm L.V.Kale 28 
- Example: Circular Fixed-Size Queues We will look at efficient implementation of shared queues Depending on sharing assumptions, one can make more efficient queues General: multiple producers/consumers Single producer/single consumer Multiple producers single consumer Steal queues: (popularized by Cilk) L.V.Kale 29 
- Circular Queues: Implementation Array of fixed size 2^n Masking of indices 3 0 1 2 1022 1023 1021 1022 1023 1 2 3 1021 1025 0 1024 Masking of Indices 1025: 10000000001 1: 0000000001 Masking L.V.Kale 30 
- Single Producer Single Consumer Queue We will look at fixed-size queue Not allowed to grow beyond a fixed limit Single producer accesses the tail Single consumer accesses the head No contention on head and tail Count number of elements in the queue: used to safeguard against empty and full conditions on the queue Three implementations Lockless Thread Unsafe Locking Thread Safe Lockless Thread Safe L.V.Kale 31 
- Single Producer Single Consumer: Lockless Thread Unsafe bool deq(T &out){ int ret=0; class SPSCQueue{ private: T* arr; int count; int head,tail; if(count>0){ count--; out=arr[mask(head++)]; ret=1; } return ret; } public: bool enq(T &data){ int ret=0; if(count<capacity){ count++; arr[mask(tail++)]=data; ret= 1;} Using the notion of 'count' to safeguard the queue But there is a race condition on the access of count Both producer and consumer can try to access 'count' at the same time Can lead to inconsistency return ret; } L.V.Kale 32 
- Single Producer Single Consumer: Locking Thread Safe class SPSCQueue{ private: T* arr; int count int head,tail; mutex mtx; bool deq(T &out){ int ret=0; mtx.lock(); if(count>0){ count--; out=arr[mask(head++)]; ret=1; } mtx.unlock(); return ret; } Using the notion of 'count' to safeguard the queue Once the mtx is acquired by a thread no other thread can acquire it before mtx.unlock() Other threads wait till the lock is released Locking and unlocking overheads are significant Note: always release lock before return statement public: bool enq(T &data){ int ret=0; mtx.lock(); if(count<capacity){ count++; arr[mask(tail++)]=data; ret= 1;} mtx.unlock(); return ret; } L.V.Kale 33 
- Single Producer Single Consumer: Wait-Free and Thread Safe bool deq(T &out){ if(count.load()>0){ out=arr[mask(head++)].load() count.fetch_add(-1); return 1;} return 0; } class SPSCQueue{ private: array<atomic<T>,capacity> arr; atomic<int> count; int head,tail; Make count atomic (accessed by producer and consumer) to prevent contention 2 atomic operations per enq or deq operation in the normal case public: bool enq(T &data){ if(count.load()<capacity){ arr[mask(tail++)].store(data); count.fetch_add(1); return 1;} return 0; } L.V.Kale 34 
- C++11 Atomics: Queues Multiple Producer Single Consumer Queue 
- Multiple Producer Single Consumer Queue Assume fixed size, power of 2, queue We will use the notion of an 'EMPTY' element A specific value denotes empty in the queue (say 1) Producer thread checks if a position contains EMPTY before inserting to it Consumer thread makes sure a position does not contain EMPTY before extracting a value from it After extracting the value it inserts EMPTY in that position L.V.Kale 36 
- Multiple Producers Single Consumer: Thread Unsafe class MPSC_Queue{ private: T* arr; int head; int tail; bool deq(T &out){ if(arr[mask(head)]==EMPTY)return 0; else{ out=arr[head]; arr[mask(head++)]=EMPTY; return 1; } } public: bool enq(T &data){ if(arr[mask(tail)]!=EMPTY)return 0; else{ arr[mask(tail++)]=data; return 1; } } There is a race condition on tail as result of multiple producers There is a race on each cell in arr as a producer and consumer thread access it, without synchronization No race on head: only 1 thread accesses it L.V.Kale 37 
- Multiple producers Single Consumer: Locking-Thread Safe bool deq(T &out){ int ret; mtx.lock(); if(arr[mask(head)]==EMPTY)ret=0; else{ out=arr[mask(head++)]; arr[mask(head++)]=EMPTY; ret=1;} mtx.unlock(); return ret; } class MPSC_Locking_Queue{ private: T* arr; int head,tail; mutex mtx; public: bool enq(T &data){ int ret; mtx.lock(); if(arr[mask(tail)]==EMPTY){ arr[mask(tail++)]=data; ret=1;} else ret=0; mtx.unlock(); return ret; } Once the mtx is acquired by a thread no other thread can acquire it before mtx.unlock() Other threads wait on the critical section till the lock is released Locking and unlocking overheads are significant Note: always release lock before return statement (ret helps with that) L.V.Kale 40 
- Multiple Producers Single Consumer: Thread Unsafe (1.0) class MPSC_Queue{ private: T* arr; int head int tail; public: bool enq(T &data){ if(arr[mask(tail)]==EMPTY){ arr[mask(tail++)]=data; bool deq(T &out){ out=arr[mask(head)]; if(out==EMPTY)return 0; else{ arr[mask(head++)]=EMPTY; return 1; } } We will modify the lockless thread unsafe version into a lockless thread safe version in 2 steps return 1; } else return 0; } L.V.Kale 42 
- Multiple producers Single Consumer: Step 1 class MPSC_Queue{ private: T* arr; int head; atomic<int> tail; public: bool enq(T &data){ int mytail=tail.fetch_add(1); if(arr[mask(mytail)]==EMPTY){ arr[mask(mytail)] = data; return 1;} else{ tail.fetch_add(-1); return 0;} } bool deq(T &out){ out=arr[mask(head)]; if(out==EMPTY)return 0; else{ arr[mask(head++)]=EMPTY; return 1; } } The previous version was vulnerable to conflicts between 2 producers Access to the tail was not protected Change the tail into an atomic variable Modify the operations accordingly (atomic::fetch_add) replaces post increment operation L.V.Kale 43 
- Multiple Producers Single Consumer: Step 2 class MPSC_Queue{ private: array<atomic<T>,capacity> arr; int head; atomic<int> tail; public: bool enq(T &data){ int mytail=tail.fetch_add(1); if(arr[mask(mytail)].load()==EMPTY) arr[mask(mytail)].store(data); return 1;} else{ tail.fetch_add(-1); return 0;} } bool deq(T &out){ out=arr[mask(head)].load(); if(out==EMPTY)return 0; else{ arr[mask(head++)].store(EMPTY); return 1; } } The previous version did not prevent a race between a producer and a consumer Make the underlying data structure of the queue an array of atomics Modify the operations accordingly Note: load() and store() calls are not compulsory on atomics. A simple assignment operation can do the same operation L.V.Kale 44 
- Multiple Producers Single Consumer: class MPSC_Queue{ private: array<atomic<T>,capacity> arr; int head; atomic<int> tail; public: bool enq(T &data){ int mytail=tail.fetch_add(1); if(arr[mask(mytail)].load()==EMPTY) arr[mask(mytail)].store(data); return 1;} else{ tail.fetch_add(-1); return 0;} } bool deq(T &out){ out=arr[mask(head)].load(); if(out==EMPTY)return 0; else{ arr[mask(head++)].store(EMPTY); return 1; } } Analysis: 3 Atomic Operations per enq in the normal case 3 Atomic Operations per enq if queue is full 2 Atomic Operations per deq in the normal case 1 Atomic Operations per deq if queue is empty L.V.Kale 45 
- Purple here denotes the Empty value P1 atomically fetches and increments the value of global tail C1 attempts dequeue and finds empty value (failure) P2 atomically fetches and increments the value of global tail P2 adds 15 to the queue P1 adds 7 to the queue C1 (non-atomically) fetches and increments the value of global head C1 attempts dequeue again and finds 7 (success) C1 attempts dequeue again and finds 15 (success) C1 C1Head Head C1Head Fail Head Head 7 15 Tail Tail P2Tail P1Tail Tail P2 P1 
- Not done yet! There is a bug Head is pointing to x and tail to x as well. The queue is full What if P1 comes in to enqueue, gets index x as the tail where it should insert, Now tail is x+1 (P1:mytail is x) Q[x] is not empty, P1 goes to the else clause, but P1 has not executed the else clause yet In the meanwhile, P2 comes to enqueue, also increments tail so it has x+1 as the place where it will insert (P2: mytail is x+1), global tail is x+2 Consumer comes to deque twice So head now becomes x+2, and Q[x] and Q[x+1] are both empty Next, P1 continues Decrements tail, so global tail is now x+1 Returns with failure (returns 0). Now, we have a hole in the queue! No data at x, but P2 s data is at x+1 Consumer will never go past the hole, P2 s data will get overwritten by the next enqueue L.V.Kale 48 
- Tail is 2, Head is 2, Queue is full Process P1 comes to enque P1.mytail == 2 tail == 3 P1 discovers arr[2] is not empty But before it could execute the else P2 comes to enque (say, the value 90) P1.mytail == 3 tail == 4 But before it executes the if.. Consumer C dequeues twice Dequeues 82 and 43 ; replaces them with empty values Now P2 executes the if, finds arr[3] empty, and enqueues its value there P1 executes else, decrements tail, and leaves an empty value (hole) at arr[2]! bool enq(T &data){ int mytail=tail.fetch_add(1); if(arr[mask(mytail)].load()==EMPTY){ arr[mask(mytail)].store(data); return 1;} else{ tail.fetch_add(-1); return 0;} } Head 0 1 4 5 6 7 2 3 76 12 82 43 90 25 31 69 52 Tail L.V.Kale 49 
- The problem in a nutshell The problem was that P1 reserved a spot in the q ueue, but gave up and returned A process must enqueue in a spot it reserves, even if it has to wait L.V.Kale 50 
- Multiple Producers Single Consumer: LockLess Corrected class MPSC_Lockless_Queue{ private: array<atomic<T>,capacity> arr; int head; atomic<int> tail; public: bool enq(T &data){ int mytail=tail.fetch_add(1); while (arr[mask(mytail)].load()!=EMPTY) ; // eventually consumer will empty it arr[mask(mytail)].store(data); return 1; } bool deq(T &out){ out=arr[mask(head)].load(); if(out==EMPTY)return 0; else{ arr[mask(head++)].store(EMPTY); return 1; } } L.V.Kale 51 
- Wait-free? Now, of course, our queue is not wait-free in a strict sense because of the while loop You could Argue that the queue-size must be chosen such that it never becomes full Say that its wait-free except when the queue is full OR: rewrite the code, so that you remember the value we need to enque in a private data structure (queue?) and return with failure, but a process which gets a failure on enqueue must call again to try to enqueue Use a overflow queue protected by lock?? Does the FIFO ordering matter? L.V.Kale 52 
- Cilk A Task Based Parallel Language 
- Cilk Language Developed over 15+ years Two calls added to C: Popularized the idea of work stealing But the idea is older: MultiLisp (Halstead) State-space search: Vipin Kumar Formalized the idea of work stealing Proofs on optimality / time and space complexity Intel s Cilk++ : via Cilk Arts L.V.Kale 54 
- cilk int fib (int n) { if (n < 2) return n; else { } Only 2 keywords added to C Program semantics is the same as if you delete the 2 (red) keywords from the program int x, y; x = spawn fib (n-1); y = spawn fib (n-2); sync; return (x+y); } Example from: http://supertech.lcs.mit.edu/cilk/intro.html L.V.Kale 55 
- Possible Implementations User level threads and suspensions Task scheduling (who does what when) Centralized queue: 1 queue for every core, and assign work to one of them randomly Load imbalance: some core may get too much Extension: balance queues periodically 1 queue for every core, keep work in your queue, initially Periodic balancing Alternative: idle processor de-queues work from someone else s queue: This is called work stealing good balance, scalablity?, overhead (locks for queues) L.V.Kale 56 
