
Exploring Parallelism in Standard C++ Programming
Discover the world of parallelism in Standard C++ programming with insights on task regions, parallel algorithms, vectorization, and fundamentals like memory models, atomics, and more. Follow a journey through various parallelism techniques and implementations, including discussions on OpenMP, MPI, CUDA, and others. Dive into examples like Quicksort optimization using threads and task regions. Learn about work stealing and under-the-hood details of parallel programming in C++.
Uploaded on | 0 Views
Download Presentation

Please find below an Image/Link to download the presentation.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.
You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.
E N D
Presentation Transcript
Parallelism in the Standard C++: What to Expect in C++ 17 Artur Laksberg Microsoft Corp. May 8th, 2014
Agenda Fundamentals Task regions Parallel Algorithms Parallelization Vectorization
Renderscript OpenMP PPL MPI CUDA C++ AMP TBB OpenACC Cilk Plus OpenCL GCD
Parallelism in C++11/14 Fundamentals: Basics: Memory model thread Atomics mutex condition_variable async future
Quicksort: Serial void quicksort(int *v, int start, int end) { if (start < end) { int pivot = partition(v, start, end); quicksort(v, start, pivot - 1); quicksort(v, pivot + 1, end); } }
Quicksort: Use Threads Problem 1: expensive void quicksort(int *v, int start, int end) { if (start < end) { int pivot = partition(v, start, end); std::thread t1([&] { quicksort(v, start, pivot - 1); }); Problem 3: Exceptions?? std::thread t2([&] { quicksort(v, pivot + 1, end); }); Problem 2: Fork-join not enforced t1.join(); t2.join(); } }
Quicksort: Fork-Join Parallelism parallel region void quicksort(int *v, int start, int end) { if (start < end) { int pivot = partition(v, start, end); task quicksort(v, start, pivot - 1); task quicksort(v, pivot + 1, end); } }
Quicksort: Using Task Regions (N3832) parallel region void quicksort(int *v, int start, int end) { if (start < end) { task_region([&] (auto& r) { int pivot = partition(v, start, end); task r.run([&] { quicksort(v, start, pivot - 1); }); task r.run([&] { quicksort(v, pivot + 1, end); }); }); } }
Work Stealing Scheduling proc 2 proc 1 proc 3 proc 4
Work Stealing Scheduling proc 2 proc 1 proc 3 proc 4 Old items New items
Work Stealing Scheduling proc 2 proc 1 proc 3 proc 4 Old items New items
Work Stealing Scheduling proc 2 proc 1 proc 3 proc 4 Old items New items Thief
Fork-Join Parallelism and Work Stealing Q1: What thread runs f? Q2: What thread runs g? e(); e() task_region([] (auto& r) { r.run(f); g() f() g(); }); Q3: What thread runs h? h(); h()
Work Stealing Design Choices What Thread Executes After a Spawn? What Thread Executes After a Join? Child Stealing Stalling: initiating thread waits Continuation (parent) Stealing Greedy: the last thread to reach join continues task_region([] (auto& r) { for(int i=0; i<n; ++i) r.run(f); });
Inspiration Intel Performing Parallel Operations On Containers Threading Building Blocks Microsoft Parallel Patterns Library, C++ AMP Nvidia Thrust
Parallel STL Just like STL, only parallel Can be faster If you know what you re doing Two Execution Policies: std:par std::vec
Parallelization: Whats a Big Deal? Why not already parallel? std::sort(begin, end, [](int a, int b) { return a < b; }); User-provided closures must be thread safe: int comparisons = 0; std::sort(begin, end, [&](int a, int b) { comparisons++; return a < b; }); But also special-member functions, std::swap etc.
Its a Contract What the user can do What the implementer can do Asymptotic Guarantees: std::sort: O(n*log(n)), std::stable_sort: O(n*log2(n)), what about parallel sort? What is a valid implementation? (see next slide)
Chaos Sort template<typename Iterator, typename Compare> void chaos_sort( Iterator first, Iterator last, Compare comp ) { auto n = last-first; std::vector<char> c(n); for(;;) { bool flag = false; for( size_t i=1; i<n; ++i ) { c[i] = comp(first[i],first[i-1]); flag |= c[i]; } if( !flag ) break; for( size_t i=1; i<n; ++i ) if( c[i] ) std::swap( first[i-1], first[i] ); } }
Execution Policies Built-in Execution Policies: extern const sequential_execution_policy seq; extern const parallel_execution_policy par; extern const vector_execution_policy vec; Dynamic Execution Policy: class execution_policy { public: // ... const type_info& target_type() const; template<class T> T *target(); template<class T> const T *target() const; };
Using Execution Policy To Write Paralel Code std::vector<int> vec = ... // standard sequential sort std::sort(vec.begin(), vec.end()); using namespace std::experimental::parallel; // explicitly sequential sort sort(seq, vec.begin(), vec.end()); // permitting parallel execution sort(par, vec.begin(), vec.end()); // permitting vectorization as well sort(vec, vec.begin(), vec.end());
Picking Execution Policy Dynamically size_t threshold = ... execution_policy exec = seq; if(vec.size() > threshold) { exec = par; } sort(exec, vec.begin(), vec.end());
Exception Handling In C++ philosophy, no exception is silently ignored Exception list: container of exception_ptr objects try { r = std::inner_product(std::par, a.begin(), a.end(), b.begin(), func1, func2, 0); } catch(const exception_list& list) { for(auto& exptr : list) { // process exception pointer exptr } }
Vectorization: Whats a Big Deal? int a[n] = ...; int b[n] = ...; for(int i=0; i<n; ++i) { a[i] = b[i] + c; } movdqu xmm1, XMMWORD PTR _b$[esp+eax+132] movdqu xmm0, XMMWORD PTR _a$[esp+eax+132] paddd xmm1, xmm2 paddd xmm1, xmm0 movdqu XMMWORD PTR _a$[esp+eax+132], xmm1 a[i:i+3] = b[i:i+3] + c;
Vector Lane is not a Thread! Taking locks Thread with thread_id x takes a lock Then another thread with the same thread_id enters the lock Deadlock!!! Exceptions Can we unwind 1/4thof the stack?
Vectorization: Not So Easy Any More Aliasing? void f(int* a, int*b) { for(int i=0; i<n; ++i) { a[i] = b[i] + c; func(); mov add add call ecx, DWORD PTR _b$[esp+esi+140] ecx, edi DWORD PTR _a$[esp+esi+140], ecx func } } Side effects? Dependence? Exceptions?
Vectorization Hazard: Locks Consider: f takes a lock, g releases the lock: ? for(int i=0; i<n; ++i) { lock.enter(); a[i] = b[i] + c; lock.release(); } for(int i=0; i<n; i+=4) { for(int j=0; j<4; ++j) lock.enter(); a[i:i+3] = b[i:i+3] + c; for(int j=0; j<4; ++j) lock.release(); } This transformation is not safe!
How Do We Get This? void f(int* a, int*b) { for(int i=0; i<n; ++i) { a[i] = b[i] + c; func(); } for(int i=0; i<n; i+=4) { a[i:i+3] = b[i:i+3] + c; for(int j=0; j<4; ++j) func(); } } Need a helping hand from the programmer, because
Vector Loop with Parallel STL void f(int* a, int*b) { integer_iterator begin {0}; integer_iterator end {n}; { a[i] = b[i] + c; func(); } } std::for_each( std::vec, begin, end, [&](int i)
Parallelization vs. Vectorization Parallelization Vectorization Threads Vector Lanes Stack No stack Good for divergent code Lock-step execution Relatively heavy-weight Very light-weight
When To Vectorize std::par std::vec No race conditions Same as std::vec, plus: No aliasing No Exceptions No Locks No/Little Divergence
References N3832: Task Region N3872: A Primer on Scheduling Fork-Join Parallelism with Work Stealing N3724: A Parallel Algorithms Library N3850: Working Draft, Technical Specification for C++ Extensions for Parallelism parallelstl.codeplex.com