
Charm++ Workshop 2018 - Improvements, Features, and Offload API Discussion
Discover upcoming improvements, features, and discussions at the Charm++ Workshop 2018, including topics like one-sided RDMA, memory registration, command line options, and offload API management for accelerators. Dive into technical details and future considerations for optimizing performance and functionality in Charm++.
Download Presentation

Please find below an Image/Link to download the presentation.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.
You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.
E N D
Presentation Transcript
Upcoming Improvements and Upcoming Improvements and Features in Charm++ Features in Charm++ Discussion Moderator: Eric Bohm Charm++ Workshop 2018 1
One-sided, RDMA, zero-copy Direct API integrated into Charm See Nitin Bhat s talk for details Note: uses IPC (i.e., CMA) for cross process within node Get vs Put Put semantics : when is it safe to write to remote memory? Message layer completion notification weak for put Get semantics : when is the remote data available? If your application already has that knowledge, get will have lower latency Memory registration Can t access a page that isn t mapped We have four strategies with different costs to choose from (next slide) We handle this for messages already, but if you want to zero copy send from your own buffers, you have to think about this issue. Should the Direct API cover GPU to GPU operations? Charm++ Workshop 2018 2
Memory registration 1. Use CkRdmaAlloc / CkRdmaFree to allocate your buffers: use CK_BUFFER_PREREG mode you assert that all your buffers are accessible (i.e., pinned) to get max performance benefits. 2. CK_BUFFER_UNREG You expect the runtime system to handle it as necessary. May incur registration or copy overhead. 3. CK_BUFFER_REG request that the Direct API register your buffers. May incur per transaction registration overhead. 4. CK_BUFFER_NOREG no guarantee regarding pinning, RDMA not supported. Generic support for Ncpy operations. Standard message protocols in use with associated copy overheads. Is this API sufficient? Charm++ Workshop 2018 3
Command Line Options How wide should a run be if the user provides no arguments? Default for MPI is 1 process with 1 core with 1 rank Conservative choice, but is that really what the user intended? +autoProvision Everything we can see is ours +processPerSocket +wthPerCore is probably the right answer Unless you need to leave cores free for OS voodoo (+excludecore ) Are there other common command line issues we should address? Charm++ Workshop 2018 4
Offload API Manage the offloading of work to accelerators Support multiple accelerators per host and per process Completion converted to Charm Callback event. Allow work to be done on GPU or CPU Based on utilization and suitability CUDA only That is where the platforms have been and are going Are there other aspects of accelerator interaction that we should prioritize? How much priority should we place on other accelerator APIs? Charm++ Workshop 2018 5
C++ Integration vector directly supported in reductions [inline] supports templated methods R-value references supported PUP supports enums, deque, forward_list CkLoop supports lambda syntax Which advanced C++ features should be prioritized? Charm++ Workshop 2018 6
CharmPy Basic Charm++ support in Python Limited load balancer selection No nodegroup No SMP mode (python lock) Which parts of the Python ecosystem should we prioritize for compatibility? Are there use cases for CharmPy that sound interesting to you? Charm++ Workshop 2018 7
Within Node Parallelism Support for Boost Threads (uFcontext) Default choice for platforms where they don t break other features (not OSX) Lowest context switching cost Integration of LLVM OpenMP implementation Supports clean interoperability between Charm++ ULT (CkLoop, [threaded] etc) and OpenMP to avoid oversubscription and resource contention Finer controls for CkLoop work stealing strategies Our support for the OpenMP task API is weak How important is that to you? Charm++ Workshop 2018 8
AMPI Improved point-to-point communication latency and bandwidth, particularly for messages within a process. Updated AMPI_Migrate() with built-in MPI_Info objects, such as AMPI_INFO_LB_SYNC.- Fixes to MPI_Sendrecv_replace, MPI_(I)Alltoall{v,w}, MPI_(I)Scatter(v), MPI_IN_PLACE in gather collectives, MPI_Type_free, MPI_Op_free, and MPI_Comm_free. Implemented MPI_Comm_create_group and MPI_Dist_graph support. Added support for using -tlsglobals for privatization of global/static variables in shared objects. Previously -tlsglobals required static linking. AMPI only renames the user's MPI calls from MPI_* to AMPI_* if Charm++/AMPI is built on top of another MPI implementation for communication. Support for compiling mpif.h in both fixed form and free form. PMPI profiling interface support added. Added an ampirun script that wraps charmrun to enable easier integration with build and test scripts that already take mpirun/mpiexec as an option. Which incomplete aspects of MPI 3 are of highest importance to you? Charm++ Workshop 2018 9
Charm Next Generation (post 6.9) How attached are you to the current Load Balancing API? Should per PE load balancing be the focus, or per host? Should chares be bound to a PE by default? Should entry methods be non-reentrant by default? Should unbound chares be non-reentrant by default? How much of a burden is charmxi to your application development? Dedicated scheduler 1:1 with execution stream and hardware thread vs Selectable number of schedulers bound to execution steams with remainder as drones executing work stealing queues. Should we implement multiple comm threads per process? vs No dedicated comm threads (ala PAMI layer)? Charm++ Workshop 2018 10