Exploring Parallelism in Standard C++ Programming

parallelism in the standard c what to expect n.w
1 / 41
Embed
Share

Discover the world of parallelism in Standard C++ programming with insights on task regions, parallel algorithms, vectorization, and fundamentals like memory models, atomics, and more. Follow a journey through various parallelism techniques and implementations, including discussions on OpenMP, MPI, CUDA, and others. Dive into examples like Quicksort optimization using threads and task regions. Learn about work stealing and under-the-hood details of parallel programming in C++.

  • C++
  • Parallelism
  • Standard
  • Programming
  • Algorithms

Uploaded on | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.

E N D

Presentation Transcript


  1. Parallelism in the Standard C++: What to Expect in C++ 17 Artur Laksberg Microsoft Corp. May 8th, 2014

  2. Agenda Fundamentals Task regions Parallel Algorithms Parallelization Vectorization

  3. Part 1: The Fundamentals

  4. Renderscript OpenMP PPL MPI CUDA C++ AMP TBB OpenACC Cilk Plus OpenCL GCD

  5. Parallelism in C++11/14 Fundamentals: Basics: Memory model thread Atomics mutex condition_variable async future

  6. Quicksort: Serial void quicksort(int *v, int start, int end) { if (start < end) { int pivot = partition(v, start, end); quicksort(v, start, pivot - 1); quicksort(v, pivot + 1, end); } }

  7. Quicksort: Use Threads Problem 1: expensive void quicksort(int *v, int start, int end) { if (start < end) { int pivot = partition(v, start, end); std::thread t1([&] { quicksort(v, start, pivot - 1); }); Problem 3: Exceptions?? std::thread t2([&] { quicksort(v, pivot + 1, end); }); Problem 2: Fork-join not enforced t1.join(); t2.join(); } }

  8. Quicksort: Fork-Join Parallelism parallel region void quicksort(int *v, int start, int end) { if (start < end) { int pivot = partition(v, start, end); task quicksort(v, start, pivot - 1); task quicksort(v, pivot + 1, end); } }

  9. Quicksort: Using Task Regions (N3832) parallel region void quicksort(int *v, int start, int end) { if (start < end) { task_region([&] (auto& r) { int pivot = partition(v, start, end); task r.run([&] { quicksort(v, start, pivot - 1); }); task r.run([&] { quicksort(v, pivot + 1, end); }); }); } }

  10. Under The Hood

  11. Work Stealing Scheduling proc 2 proc 1 proc 3 proc 4

  12. Work Stealing Scheduling proc 2 proc 1 proc 3 proc 4 Old items New items

  13. Work Stealing Scheduling proc 2 proc 1 proc 3 proc 4 Old items New items

  14. Work Stealing Scheduling proc 2 proc 1 proc 3 proc 4 Old items New items Thief

  15. Fork-Join Parallelism and Work Stealing Q1: What thread runs f? Q2: What thread runs g? e(); e() task_region([] (auto& r) { r.run(f); g() f() g(); }); Q3: What thread runs h? h(); h()

  16. Work Stealing Design Choices What Thread Executes After a Spawn? What Thread Executes After a Join? Child Stealing Stalling: initiating thread waits Continuation (parent) Stealing Greedy: the last thread to reach join continues task_region([] (auto& r) { for(int i=0; i<n; ++i) r.run(f); });

  17. Part 2: The Algorithms

  18. Alex Stepanov: Start With The Algorithms

  19. Inspiration Intel Performing Parallel Operations On Containers Threading Building Blocks Microsoft Parallel Patterns Library, C++ AMP Nvidia Thrust

  20. Parallel STL Just like STL, only parallel Can be faster If you know what you re doing Two Execution Policies: std:par std::vec

  21. Parallelization: Whats a Big Deal? Why not already parallel? std::sort(begin, end, [](int a, int b) { return a < b; }); User-provided closures must be thread safe: int comparisons = 0; std::sort(begin, end, [&](int a, int b) { comparisons++; return a < b; }); But also special-member functions, std::swap etc.

  22. Its a Contract What the user can do What the implementer can do Asymptotic Guarantees: std::sort: O(n*log(n)), std::stable_sort: O(n*log2(n)), what about parallel sort? What is a valid implementation? (see next slide)

  23. Chaos Sort template<typename Iterator, typename Compare> void chaos_sort( Iterator first, Iterator last, Compare comp ) { auto n = last-first; std::vector<char> c(n); for(;;) { bool flag = false; for( size_t i=1; i<n; ++i ) { c[i] = comp(first[i],first[i-1]); flag |= c[i]; } if( !flag ) break; for( size_t i=1; i<n; ++i ) if( c[i] ) std::swap( first[i-1], first[i] ); } }

  24. Execution Policies Built-in Execution Policies: extern const sequential_execution_policy seq; extern const parallel_execution_policy par; extern const vector_execution_policy vec; Dynamic Execution Policy: class execution_policy { public: // ... const type_info& target_type() const; template<class T> T *target(); template<class T> const T *target() const; };

  25. Using Execution Policy To Write Paralel Code std::vector<int> vec = ... // standard sequential sort std::sort(vec.begin(), vec.end()); using namespace std::experimental::parallel; // explicitly sequential sort sort(seq, vec.begin(), vec.end()); // permitting parallel execution sort(par, vec.begin(), vec.end()); // permitting vectorization as well sort(vec, vec.begin(), vec.end());

  26. Picking Execution Policy Dynamically size_t threshold = ... execution_policy exec = seq; if(vec.size() > threshold) { exec = par; } sort(exec, vec.begin(), vec.end());

  27. Exception Handling In C++ philosophy, no exception is silently ignored Exception list: container of exception_ptr objects try { r = std::inner_product(std::par, a.begin(), a.end(), b.begin(), func1, func2, 0); } catch(const exception_list& list) { for(auto& exptr : list) { // process exception pointer exptr } }

  28. Vectorization: A Tale From Agriculture

  29. A Tale From Agriculture

  30. A Tale From Agriculture

  31. Idea: Fewer Tractors, Wider Plows

  32. Vectorization: Whats a Big Deal? int a[n] = ...; int b[n] = ...; for(int i=0; i<n; ++i) { a[i] = b[i] + c; } movdqu xmm1, XMMWORD PTR _b$[esp+eax+132] movdqu xmm0, XMMWORD PTR _a$[esp+eax+132] paddd xmm1, xmm2 paddd xmm1, xmm0 movdqu XMMWORD PTR _a$[esp+eax+132], xmm1 a[i:i+3] = b[i:i+3] + c;

  33. Vector Lane is not a Thread! Taking locks Thread with thread_id x takes a lock Then another thread with the same thread_id enters the lock Deadlock!!! Exceptions Can we unwind 1/4thof the stack?

  34. Vectorization: Not So Easy Any More Aliasing? void f(int* a, int*b) { for(int i=0; i<n; ++i) { a[i] = b[i] + c; func(); mov add add call ecx, DWORD PTR _b$[esp+esi+140] ecx, edi DWORD PTR _a$[esp+esi+140], ecx func } } Side effects? Dependence? Exceptions?

  35. Vectorization Hazard: Locks Consider: f takes a lock, g releases the lock: ? for(int i=0; i<n; ++i) { lock.enter(); a[i] = b[i] + c; lock.release(); } for(int i=0; i<n; i+=4) { for(int j=0; j<4; ++j) lock.enter(); a[i:i+3] = b[i:i+3] + c; for(int j=0; j<4; ++j) lock.release(); } This transformation is not safe!

  36. How Do We Get This? void f(int* a, int*b) { for(int i=0; i<n; ++i) { a[i] = b[i] + c; func(); } for(int i=0; i<n; i+=4) { a[i:i+3] = b[i:i+3] + c; for(int j=0; j<4; ++j) func(); } } Need a helping hand from the programmer, because

  37. Vector Loop with Parallel STL void f(int* a, int*b) { integer_iterator begin {0}; integer_iterator end {n}; { a[i] = b[i] + c; func(); } } std::for_each( std::vec, begin, end, [&](int i)

  38. Parallelization vs. Vectorization Parallelization Vectorization Threads Vector Lanes Stack No stack Good for divergent code Lock-step execution Relatively heavy-weight Very light-weight

  39. When To Vectorize std::par std::vec No race conditions Same as std::vec, plus: No aliasing No Exceptions No Locks No/Little Divergence

  40. References N3832: Task Region N3872: A Primer on Scheduling Fork-Join Parallelism with Work Stealing N3724: A Parallel Algorithms Library N3850: Working Draft, Technical Specification for C++ Extensions for Parallelism parallelstl.codeplex.com

More Related Content