Exploring Parallelism in Standard C++ Programming

parallelism in the standard c what to expect n.w

1 / 41

Embed Share

Discover the world of parallelism in Standard C++ programming with insights on task regions, parallel algorithms, vectorization, and fundamentals like memory models, atomics, and more. Follow a journey through various parallelism techniques and implementations, including discussions on OpenMP, MPI, CUDA, and others. Dive into examples like Quicksort optimization using threads and task regions. Learn about work stealing and under-the-hood details of parallel programming in C++.

betances_h Follow

Uploaded on Jun 08, 2025 | 1 Views

Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.

Download Presentation

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.

E N D

Presentation Transcript

Parallelism in the Standard C++: What to Expect in C++ 17 Artur Laksberg Microsoft Corp. May 8th, 2014

Agenda Fundamentals Task regions Parallel Algorithms Parallelization Vectorization

Part 1: The Fundamentals

Renderscript OpenMP PPL MPI CUDA C++ AMP TBB OpenACC Cilk Plus OpenCL GCD

Parallelism in C++11/14 Fundamentals: Basics: Memory model thread Atomics mutex condition_variable async future

Quicksort: Serial void quicksort(int *v, int start, int end) { if (start < end) { int pivot = partition(v, start, end); quicksort(v, start, pivot - 1); quicksort(v, pivot + 1, end); } }

Quicksort: Use Threads Problem 1: expensive void quicksort(int *v, int start, int end) { if (start < end) { int pivot = partition(v, start, end); std::thread t1([&] { quicksort(v, start, pivot - 1); }); Problem 3: Exceptions?? std::thread t2([&] { quicksort(v, pivot + 1, end); }); Problem 2: Fork-join not enforced t1.join(); t2.join(); } }

Quicksort: Fork-Join Parallelism parallel region void quicksort(int *v, int start, int end) { if (start < end) { int pivot = partition(v, start, end); task quicksort(v, start, pivot - 1); task quicksort(v, pivot + 1, end); } }

Quicksort: Using Task Regions (N3832) parallel region void quicksort(int *v, int start, int end) { if (start < end) { task_region([&] (auto& r) { int pivot = partition(v, start, end); task r.run([&] { quicksort(v, start, pivot - 1); }); task r.run([&] { quicksort(v, pivot + 1, end); }); }); } }

Under The Hood

Work Stealing Scheduling proc 2 proc 1 proc 3 proc 4

Work Stealing Scheduling proc 2 proc 1 proc 3 proc 4 Old items New items

Work Stealing Scheduling proc 2 proc 1 proc 3 proc 4 Old items New items

Work Stealing Scheduling proc 2 proc 1 proc 3 proc 4 Old items New items Thief

Fork-Join Parallelism and Work Stealing Q1: What thread runs f? Q2: What thread runs g? e(); e() task_region([] (auto& r) { r.run(f); g() f() g(); }); Q3: What thread runs h? h(); h()

Work Stealing Design Choices What Thread Executes After a Spawn? What Thread Executes After a Join? Child Stealing Stalling: initiating thread waits Continuation (parent) Stealing Greedy: the last thread to reach join continues task_region([] (auto& r) { for(int i=0; i<n; ++i) r.run(f); });

Part 2: The Algorithms

Alex Stepanov: Start With The Algorithms

Inspiration Intel Performing Parallel Operations On Containers Threading Building Blocks Microsoft Parallel Patterns Library, C++ AMP Nvidia Thrust

Parallel STL Just like STL, only parallel Can be faster If you know what you re doing Two Execution Policies: std:par std::vec

Parallelization: Whats a Big Deal? Why not already parallel? std::sort(begin, end, [](int a, int b) { return a < b; }); User-provided closures must be thread safe: int comparisons = 0; std::sort(begin, end, [&](int a, int b) { comparisons++; return a < b; }); But also special-member functions, std::swap etc.

Its a Contract What the user can do What the implementer can do Asymptotic Guarantees: std::sort: O(n*log(n)), std::stable_sort: O(n*log2(n)), what about parallel sort? What is a valid implementation? (see next slide)

Chaos Sort template<typename Iterator, typename Compare> void chaos_sort( Iterator first, Iterator last, Compare comp ) { auto n = last-first; std::vector<char> c(n); for(;;) { bool flag = false; for( size_t i=1; i<n; ++i ) { c[i] = comp(first[i],first[i-1]); flag |= c[i]; } if( !flag ) break; for( size_t i=1; i<n; ++i ) if( c[i] ) std::swap( first[i-1], first[i] ); } }

Execution Policies Built-in Execution Policies: extern const sequential_execution_policy seq; extern const parallel_execution_policy par; extern const vector_execution_policy vec; Dynamic Execution Policy: class execution_policy { public: // ... const type_info& target_type() const; template<class T> T *target(); template<class T> const T *target() const; };

Using Execution Policy To Write Paralel Code std::vector<int> vec = ... // standard sequential sort std::sort(vec.begin(), vec.end()); using namespace std::experimental::parallel; // explicitly sequential sort sort(seq, vec.begin(), vec.end()); // permitting parallel execution sort(par, vec.begin(), vec.end()); // permitting vectorization as well sort(vec, vec.begin(), vec.end());

Picking Execution Policy Dynamically size_t threshold = ... execution_policy exec = seq; if(vec.size() > threshold) { exec = par; } sort(exec, vec.begin(), vec.end());

Exception Handling In C++ philosophy, no exception is silently ignored Exception list: container of exception_ptr objects try { r = std::inner_product(std::par, a.begin(), a.end(), b.begin(), func1, func2, 0); } catch(const exception_list& list) { for(auto& exptr : list) { // process exception pointer exptr } }

Vectorization: A Tale From Agriculture

A Tale From Agriculture

A Tale From Agriculture

Idea: Fewer Tractors, Wider Plows

Vectorization: Whats a Big Deal? int a[n] = ...; int b[n] = ...; for(int i=0; i<n; ++i) { a[i] = b[i] + c; } movdqu xmm1, XMMWORD PTR _b$[esp+eax+132] movdqu xmm0, XMMWORD PTR _a$[esp+eax+132] paddd xmm1, xmm2 paddd xmm1, xmm0 movdqu XMMWORD PTR _a$[esp+eax+132], xmm1 a[i:i+3] = b[i:i+3] + c;

Vector Lane is not a Thread! Taking locks Thread with thread_id x takes a lock Then another thread with the same thread_id enters the lock Deadlock!!! Exceptions Can we unwind 1/4thof the stack?

Vectorization: Not So Easy Any More Aliasing? void f(int* a, int*b) { for(int i=0; i<n; ++i) { a[i] = b[i] + c; func(); mov add add call ecx, DWORD PTR _b$[esp+esi+140] ecx, edi DWORD PTR _a$[esp+esi+140], ecx func } } Side effects? Dependence? Exceptions?

Vectorization Hazard: Locks Consider: f takes a lock, g releases the lock: ? for(int i=0; i<n; ++i) { lock.enter(); a[i] = b[i] + c; lock.release(); } for(int i=0; i<n; i+=4) { for(int j=0; j<4; ++j) lock.enter(); a[i:i+3] = b[i:i+3] + c; for(int j=0; j<4; ++j) lock.release(); } This transformation is not safe!

How Do We Get This? void f(int* a, int*b) { for(int i=0; i<n; ++i) { a[i] = b[i] + c; func(); } for(int i=0; i<n; i+=4) { a[i:i+3] = b[i:i+3] + c; for(int j=0; j<4; ++j) func(); } } Need a helping hand from the programmer, because

Vector Loop with Parallel STL void f(int* a, int*b) { integer_iterator begin {0}; integer_iterator end {n}; { a[i] = b[i] + c; func(); } } std::for_each( std::vec, begin, end, [&](int i)

Parallelization vs. Vectorization Parallelization Vectorization Threads Vector Lanes Stack No stack Good for divergent code Lock-step execution Relatively heavy-weight Very light-weight

When To Vectorize std::par std::vec No race conditions Same as std::vec, plus: No aliasing No Exceptions No Locks No/Little Divergence

References N3832: Task Region N3872: A Primer on Scheduling Fork-Join Parallelism with Work Stealing N3724: A Parallel Algorithms Library N3850: Working Draft, Technical Specification for C++ Extensions for Parallelism parallelstl.codeplex.com

Exploring Parallelism in Standard C++ Programming

Download Presentation

Presentation Transcript

Related

More Related Content