Lightweight Thread Approaches for High Performance Computing
This review discusses the use of lightweight thread libraries to maximize on-node parallelism for high-performance computing. It compares lightweight thread approaches with standard solutions like Pthreads, highlighting their benefits and drawbacks. Various lightweight thread libraries and high-level programming models are examined for their effectiveness in exploiting fine-grained task parallelism.
Download Presentation

Please find below an Image/Link to download the presentation.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.
You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.
E N D
Presentation Transcript
A Review of Lightweight Thread Approaches for High Performance Computing Adri n Castell Rafael Mayo Enrique S. Quintana-Ort Sangmin Seo Pavan Balaji Antonio J. Pe a Universitat Jaume I de Castell (Spain) Barcelona Supercomputing Center (Spain) Argonne National Lab (USA) IEEE Cluster 2016. 13rd 15th September. Taipei (Taiwan)
Motivation Exascale systems will offer massive concurrent hardware On-node parallelism is inevitable 1 2 4 6 8 9-14 16 18-260 100% 90% 80% Percentage (%) 70% 60% 50% 40% 30% 20% 10% 0% Year Cores per socket distribution on the Top500 list A. Castell 1 IEEE Cluster 2016
Current solution Pthreads is the standard to exploit the current on-node parallelism Using pthreads library Using high-level programing models Pros: Works well for hardware characteristics Cons: Falls from the point of view of software requirements Context switch and synchronizations are expensive mechanisms OS thread A. Castell 2 IEEE Cluster 2016
Lightweight Thread Libraries OS thread U U U User-level thread U Lightweight thread with low context-switch overhead To better overlap computation and communication/IO To exploit fine-grained task parallelism A. Castell 3 IEEE Cluster 2016
Lightweight Thread Libraries High-level programming model ConverseThreads Nanos++ Specific OS Windows Fibers Solaris Threads Lightweight Thread abstraction Cilk Go Intel TBB Specific Hardware Tiny-Threads General purpose Stackless threads MassiveThreads Argobots Qthreads Qthreads MassiveThreads Argobots Stackless Python Protothreads A. Castell 4 IEEE Cluster 2016
Lightweight Thread Libraries High-level programming model ConverseThreads Nanos++ Specific OS Windows Fibers Solaris Threads Lightweight Thread abstraction Cilk Go Intel TBB Specific Hardware Tiny-Threads General purpose Stackless threads MassiveThreads Argobots Qthreads Qthreads MassiveThreads Argobots Stackless Python Protothreads A. Castell 4 IEEE Cluster 2016
Lightweight Thread Libraries Concept Pthreads Argobots Qthreads Massive Threads Converse Threads Go Levels of Hierarchy 1 2 3 2 2 2 # of Work Unit Types 1 2 1 1 2 1 Group Control - Yes Yes Yes Yes Yes Global Queue Yes* Yes Yes - - Yes Private Queue Yes* Yes Yes Yes Yes - Plug-in Scheduler Yes Yes Yes Yes Yes - Stackable Scheduler - Yes - - - - Group Scheduler - Yes - - - - * By programmer A. Castell 5 IEEE Cluster 2016
Lightweight Thread Libraries Concept Pthreads Argobots Qthreads Massive Threads Converse Threads Go Levels of Hierarchy 1 2 3 2 2 2 # of Work Unit Types 1 2 1 1 2 1 Group Control - Yes Yes Yes Yes Yes Global Queue Yes* Yes Yes - - Yes Private Queue Yes* Yes Yes Yes Yes - Plug-in Scheduler Yes Yes Yes Yes Yes - Stackable Scheduler - Yes - - - - Group Scheduler - Yes - - - - * By programmer A. Castell 5 IEEE Cluster 2016
Lightweight Thread Libraries Concept Pthreads Argobots Qthreads Massive Threads Converse Threads Go Levels of Hierarchy 1 2 3 2 2 2 # of Work Unit Types 1 2 1 1 2 1 Group Control - Yes Yes Yes Yes Yes Global Queue Yes* Yes Yes - - Yes Private Queue Yes* Yes Yes Yes Yes - Plug-in Scheduler Yes Yes Yes Yes Yes - Stackable Scheduler - Yes - - - - Group Scheduler - Yes - - - - * By programmer A. Castell 5 IEEE Cluster 2016
Lightweight Thread Libraries Concept Pthreads Argobots Qthreads Massive Threads Converse Threads Go Levels of Hierarchy 1 2 3 2 2 2 # of Work Unit Types 1 2 1 1 2 1 Group Control - Yes Yes Yes Yes Yes Global Queue Yes* Yes Yes - - Yes Private Queue Yes* Yes Yes Yes Yes - Plug-in Scheduler Yes Yes Yes Yes Yes - Stackable Scheduler - Yes - - - - Group Scheduler - Yes - - - - * By programmer A. Castell 5 IEEE Cluster 2016
Lightweight Thread Libraries Concept Pthreads Argobots Qthreads Massive Threads Converse Threads Go Levels of Hierarchy 1 2 3 2 2 2 # of Work Unit Types 1 2 1 1 2 1 Group Control - Yes Yes Yes Yes Yes Global Queue Yes* Yes Yes - - Yes Private Queue Yes* Yes Yes Yes Yes - Plug-in Scheduler Yes Yes Yes Yes Yes - Stackable Scheduler - Yes - - - - Group Scheduler - Yes - - - - * By programmer More flexibility!! A. Castell 5 IEEE Cluster 2016
Why are not these libraries used? A. Castell 6 IEEE Cluster 2016
Why are not these libraries used? #pragma omp parallel Some_Code() Some_Code(); zz zz zz Implemented on top of Pthreads Directive-based programming model Parallel code with just one code line A. Castell 6 IEEE Cluster 2016
Our Target Go VS Argobots Qthreads VS Converse Threads MassiveThreads Hardware: 36-core Intel Xeon E5-2699 v3. 128 GB of Memory Software: Lightweight thread libraries updated to 05-2016. gcc 5.2 and Intel icc 15.0.1 (Intel OpenMP Runtime 20151009) Code: Sscal BLAS-1 operation. 1st Analyze the OpenMP behavior 2nd Mimic OpenMP mechanism with Lightweight Thread libraries 3rd Compare the achieved performance with OpenMP (as baseline) Method: A. Castell 7 IEEE Cluster 2016
LWT Programming Model #define N 100 void example(){ printf( Hello\n ); } Environment Initialization 1 int main(){ initialization(); ULT/Tasklet creation 2 1 Context-switch for(int i=0; i<N; i++) ULT_creation_to(example , dest); 3 2 ULT/Tasklet join 4 3 yield(); Environment Finalization 5 for(int i=0; i<N; i++) join(); 4 finalize() } 5 A. Castell 8 IEEE Cluster 2016
LWT Programming Model #define N 100 void example(){ printf( Hello\n ); } Environment Initialization 1 int main(){ initialization(); ULT/Tasklet creation 2 1 Context-switch for(int i=0; i<N; i++) ULT_creation_to(example , dest); 3 2 ULT/Tasklet join 4 3 yield(); Environment Finalization 5 for(int i=0; i<N; i++) join(); 4 finalize() } 5 A. Castell 8 IEEE Cluster 2016
LWT Programming Model II ULT/Tasklet creation 2 Step Pthreads Argobots Qthreads Massive Threads Converse Threads Go Result New Pthread New ULT or Tasklet New ULT New ULT New ULT or Tasklet New ULT Executed by OS Execution Stream Worker Worker Thread Process Oversubscription May be No No No No No ULT Destination Yes* Yes Yes No Just Tasklet No Work queues OS Private (ES) Private (W) Private Private Shared Own Queue access OS (Almost) Free Mutex Mutex (Almost) Free Mutex Load balance Application Application Application Work-steal. Application Work-shar. Main Drawback OS actions Dispatch step Contention Contention Dispatch step Contention A. Castell 9 IEEE Cluster 2016
LWT Programming Model II ULT/Tasklet creation 2 Step Pthreads Argobots Qthreads Massive Threads Converse Threads Go Result New Pthread New ULT or Tasklet New ULT New ULT New ULT or Tasklet New ULT Executed by OS Execution Stream Worker Worker Thread Process Oversubscription May be No No No No No ULT Destination Yes* Yes Yes No Just Tasklet No Work queues OS Private (ES) Private (W) Private Private Shared Own Queue access OS (Almost) Free Mutex Mutex (Almost) Free Mutex Load balance Application Application Application Work-steal. Application Work-shar. Main Drawback OS actions Dispatch step Contention Contention Dispatch step Contention A. Castell 9 IEEE Cluster 2016
LWT Programming Model II ULT/Tasklet creation 2 Step Pthreads Argobots Qthreads Massive Threads Converse Threads Go Result New Pthread New ULT or Tasklet New ULT New ULT New ULT or Tasklet New ULT Executed by OS Execution Stream Worker Worker Thread Process Oversubscription May be No No No No No ULT Destination Yes* Yes Yes No Just Tasklet No Work queues OS Private (ES) Private (W) Private Private Shared Own Queue access OS (Almost) Free Mutex Mutex (Almost) Free Mutex Load balance Application Application Application Work-steal. Application Work-shar. Main Drawback OS actions Dispatch step Contention Contention Dispatch step Contention A. Castell 9 IEEE Cluster 2016
LWT Programming Model II ULT/Tasklet creation 2 Step Pthreads Argobots Qthreads Massive Threads Converse Threads Go Result New Pthread New ULT or Tasklet New ULT New ULT New ULT or Tasklet New ULT Executed by OS Execution Stream Worker Worker Thread Process Oversubscription May be No No No No No ULT Destination Yes* Yes Yes No Just Tasklet No Work queues OS Private (ES) Private (W) Private Private Shared Own Queue access OS (Almost) Free Mutex Mutex (Almost) Free Mutex Load balance Application Application Application Work-steal. Application Work-shar. Main Drawback OS actions Dispatch step Contention Contention Dispatch step Contention A. Castell 9 IEEE Cluster 2016
LWT Programming Model II ULT/Tasklet creation 2 Step Pthreads Argobots Qthreads Massive Threads Converse Threads Go Result New Pthread New ULT or Tasklet New ULT New ULT New ULT or Tasklet New ULT Executed by OS Execution Stream Worker Worker Thread Process Oversubscription May be No No No No No ULT Destination Yes* Yes Yes No Just Tasklet No Work queues OS Private (ES) Private (W) Private Private Shared Own Queue access OS (Almost) Free Mutex Mutex (Almost) Free Mutex Load balance Application Application Application Work-steal. Application Work-shar. Main Drawback OS actions Dispatch step Contention Contention Dispatch step Contention A. Castell 9 IEEE Cluster 2016
LWT Programming Model II ULT/Tasklet creation 2 Step Pthreads Argobots Qthreads Massive Threads Converse Threads Go Result New Pthread New ULT or Tasklet New ULT New ULT New ULT or Tasklet New ULT Executed by OS Execution Stream Worker Worker Thread Process Oversubscription May be No No No No No ULT Destination Yes* Yes Yes No Just Tasklet No Work queues OS Private (ES) Private (W) Private Private Shared Own Queue access OS (Almost) Free Mutex Mutex (Almost) Free Mutex Load balance Application Application Application Work-steal. Application Work-shar. Main Drawback OS actions Dispatch step Contention Contention Dispatch step Contention A. Castell 9 IEEE Cluster 2016
Basic Functionality Creation step Dispatch overhead One ULT/Tasklet is created for each thread in LWT Only the function pointer initialization is measured in OpenMP Joining step One ULT/Tasklet is joined for each thread in LWT The join function is measured in OpenMP Barrier vs Memory status vs Work unit status A. Castell 10 IEEE Cluster 2016
OpenMP microbenchmarks I For loop #pragma omp parallel for for(i=0;i<1000;i++) code(i); One ULT for each Thread Iterations are divided between ULTs Similar to create + join figures A. Castell 11 IEEE Cluster 2016
OpenMP microbenchmarks I Nested for loop #pragma omp parallel for for(i=0;i<1000;i++){ #pragma omp parallel for \ for(j=0;j<1000;j++) code(i,j); } The Y-axis values are in seconds Converse Threads needs extra scheduler calls OpenMP generates oversubscription firstprivate(i) A. Castell 11 IEEE Cluster 2016
Nested parallelism omp_set_num_threads(4); #pragma omp parallel for for(i=0;i<1000;i++){ #pragma omp parallel for firstprivate(i) for(j=0;j<1000;j++){ code(i,j); } } 1 2 3 Step GCC OpenMP ICC OpenMP LWT 1 Creates 4 outer loop threads Creates 4 outer loop threads Creates 4 outer loop ULTs 2 Creates 3 inner loop threads for each outer loop thread Checks for idle threads and creates 3 new inner loop threads if needed Creates 4 inner loop ULTs for each outer loop ULT 3 Puts the inner loop threads inside the idle thread pool Puts the inner loop threads inside the idle thread pool Joins the 4 inner loop ULTs OS 12.004 16 4 Threads A. Castell 12 IEEE Cluster 2016
OpenMP microbenchmarks II Tasks in a single region #pragma omp parallel { #pragma omp single for(i=0;i<1000;i++){ #pragma omp task } } code(i); #pragma omp parallel { #pragma omp for for(i=0;i<1000;i++){ #pragma omp task } } code(i); Tasks in a parallel region A. Castell 13 IEEE Cluster 2016
OpenMP microbenchmarks II Tasks in a single region Each OpenMP task is converted to a ULT or Tasklet in LWT Tasklets preforms better Dispatch effect Work-sharing vs Work-stealing GCC only implements one shared task queue ICC employs one task queue for each thread and uses work-stealing Tasks in a parallel region A. Castell 13 IEEE Cluster 2016
Conclusions Lightweight thread solutions can mimic commonly parallel codes They achieve a performance that is, at least, as good as the OpenMP runtimes General purpose libraries perform better Some implementation choices with strong impact have been identified in OpenMP Runtime systems Moreover A. Castell 15 IEEE Cluster 2016
Conclusions II We found that the parallel codes can be implemented with a reduced set of LWT functions Function Argobots Qthreads Massive Threads Converse Threads Go Initialization ABT_init qthread_initialize myth_init ConverseInit - ULT creation ABT_thread_create qthread_fork myth_create CthCreate go function Yield ABT_thread_yield qthread_yield myth_yield CthYield - Join ABT_thread_free qthread_readFF myth_join - channel Finalization ABT_finalize qthread_finalize myth_fini ConverseExit - A. Castell 16 IEEE Cluster 2016
Current Work Generic Lightweight Thread (GLT) library GLT common API Qthreads MassiveThreads Argobots Common API for LWT solutions Two parts: CORE for common features EXTENDED for specific solution functions Two approaches: Stand-alone Headers Scheduling relies on the underlying library Two types of work-unit support: Tasklet and ULT www.hpca.uji.es/GLT github.com/adcastel/GLT.git A. Castell 17 IEEE Cluster 2016
Future Work To reimplement some pthreads-based high-level programming models on top of that API OpenMP OmpSs etc. A. Castell 18 IEEE Cluster 2016
Thank you! Adri n Castell (adcastel@uji.es) IEEE Cluster 2016
OpenMP microbenchmarks III void code(int i){ #pragma omp task test1(); #pragma omp task test2(); #pragma omp taskwait } ... Nested tasks #pragma omp parallel { #pragma omp single for(i=0;i<200;i++){ #pragma omp task } } code(i); A. Castell 14 IEEE Cluster 2016