
Efficient Heterogeneous Computing with DCompute Compiler
Discover how the DCompute compiler enables convenient heterogeneous computing in D, targeting CUDA and OpenCL with prewritten kernels and optimized functionalities. Explore the future directions and state of hardware for efficient computing solutions.
Download Presentation

Please find below an Image/Link to download the presentation.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.
You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.
E N D
Presentation Transcript
DCompute: Native & Convenient Heterogeneous Computing for D
Outline Introduction Compiler Libraries Using DCompute (present and future) Future directions
State of Hardware X86 all compilers ARM GDC, LDC MIPS, PPC LDC DSPs, FPGAs GPUs - ?
State of Hardware DSPs C, OpenCL FPGAs HDLs (Verilog, VHDL, DHDL?), OpenCL GPUs Shaders (GLSL, HLSL) for graphics CUDA, OpenCL for compute
CUDA NVidia only CUDA C++/Fortran => PTX => SASS Well integrated Reasonably nice to use (from C/C++)
OpenCL Many targets - GPUs, DSPs FPGAs SPIR-V OpenCL C/C++ (kernel languages) are OK API is horrible
So CUDA has vendor lock-in OpenCL isn t very nice But... LLVM targets SPIR-V and PTX We have a D compiler that targets LLVM (LDC)
Enter DCompute Compiler Enables writing kernel in D Library Automate using kernels in D Kernels Prewritten kernels for heterogeneous acceleration with less effort.
Compiler Core functionality done Work in Progress Images / Pipes Globals Better errors / error reporting SPIR-V Optimisations (not very critical)
Targeting CUDA & OpenCL Tacking on Bits of metadata to the modules Calling convention Address spacing pointers Images & other special types (WiP) Indexing (get_local_id/threadIdx) Other Instinsics
Compilation Process Models OpenCL like Separate compilation no compile time info SYCL like hybrid source code kernel is one lambda long, highly nested, hides a lot of runtime magic CUDA like hybrid source runtime magic, semantic validation harder Want something that fits well with modules & retains compile time information
Compilation Process ldc2 mdcompute-targets=ocl-220,cuda-620 files.d module normal; @compute(CompileFor.hostAndDevice) module shared_code; @compute(CompileFor.deviceOnly) module my_kernels; @kernel void foo(GlobalPointer!float f) { } if (__dcompute_reflect(target,version)) { } GlobalPointer!T -> { T addrspace(n)* } ABI Metadata
Codegen conditional compilation if (stmt->condition->op == TOKcall) { auto ce = (CallExp *)stmt->condition; if (ce->f && ce->f->ident && !strcmp(ce->f->ident->toChars(), "__dcompute_reflect")) { if (match(ce->arguments)) { stmt->ifbody->accept(this); else if (stmt->elsebody) stmt->elsebody->accept(this); } return; } }
Benefits Done in one compilation (Host, CUDA & OpenCL) No need to worry about templates Get compile time info on kernels
DCompute Standard Library for compute operations For use with kernels Driver Abstraction over OpenCl and CUDA runtimes Handles Device and Host interactions Launching kernels Managing memory Standard collection of Kernels
Standard Library Indexation Synchronisation primitives Vectors (SIMD & geometric) Math (the usual) Images (1d,2d,3d + arrays, cubes) Packing (colour operations) Atomics Work Group operations (reduce ) Backed by CUDA: libdevice + LLVM PTX intrinsics OpenCL: intrinsic operation (Magic)
Driver (WiP) Allocate & Manage device memory Data transfer Kernels: Load, Launch Device Synchronisation Events
Driver API Automation For launching kernels we want something like @kernel void my_kernel(T)(GlobalPointer!T p, int args) { ... } void main(string[] args){ auto dev = getDefaultDevice(getConfig()); auto q = dev.getDefaultQueue(); float[] arr = someData(); Buffer!float b = dev.makeBuffer(arr); Event v = q.enqueue!(my_kernel!float)(b.length)(b,42); v.wait(); b.read(arr).writeln; }
struct Queue { Call enqueue(alias kernel)(LaunchParams lp) { return Call!(typeof(kernel),kernel.mangleof) (lp, this); } } struct Call(F,string mangle) { LaunchParams lp; Queue q; Event opCall(KernelArgsOf!F args) { //Get type correctness for free! //use Parameters!F to call clSetKernelArg/ //clEnqueueNDRangeKernel or cuLaunchKernel }
Collection of Kernels Showcase of how to do things (and how not to) Functional examples Covers common use cases Convolutions (DFT/FFT) Reductions Filter, Sort Thrust-like (except ranges) Precompilation (for use with DMD/GDC)
Future work Make SPIR-V use intrinsics (LLVM, WiP) Compiler Tests (WiP) Relax DCompute constraints Images & Pipes (integrate with Phobos#2845) Globals Std lib add missing functions -> intrinsics Driver Finish integrating clWrap CUDA High level API over OpenCL + CUDA Testing framework Library integration: with e.g. mir (ndslice GLAS CV), scid Kernels: standard algorithms, NEW algorithms
Conclusion DCompute is a compiler extension of LDC to target OpenCL and CUDA (and the host) all at the same time! (working but not feature complete) Runtime libraries are a work in progress. Building kernels work, demo to follow. This is D so we are able to make this nice to use (in spite of the horribleness of the underlying APIs) thanks to awesome templates and introspection! World domintation of HPC will be within grasp!
Acknowledgments John Colvin David Nadlinger Kai Nacke Kinke Johan Engelen
Questions? LLVM: https://github.com/thewilsonator/llvm/tree/compute LDC: https://github.com/ldc-developers/ldc/tree/dcompute (master soonTM) DCompute: https://github.com/libmir/dcompute