
Advanced Machine Learning: Optimization and Parallel Computing
Explore the intersection of advanced machine learning, optimization techniques, and parallel computing in this in-depth course. Dive into GPU programming, CUDA, OpenCL, and techniques for enhancing deep learning results. Discover how parallel computing accelerates the processing of large datasets and complex algorithms, optimizing performance and efficiency for machine learning applications.
Download Presentation

Please find below an Image/Link to download the presentation.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.
You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.
E N D
Presentation Transcript
CS 732: Advance Machine Learning Usman Roshan Department of Computer Science NJIT
Course material Parallel programming: emphasis on GPU CUDA and OpenCL languages Optimization results Deep learning results Methods for learning new features Grading: Each student presents one paper: 45 min powerpoint talk Each student will do a project and present it towards the end of the semester. This would involve some programming and would be an experimental performance study. Either implement an algorithm in C or Python or compare existing algorithms on different datasets
Parallel computing Why in an advance machine learning course? Some machine learning programs take a long time to finish. For example large neural networks and kernel methods. Dataset sizes are getting larger. While linear classification and regression programs are generally very fast they can be slow on large datasets.
Examples Dot product evaluation Gradient descent algorithms Cross-validation Evaluating many folds in parallel Parameter estimation http://www.nvidia.com/object/data-science- analytics-database.html
Parallel computing Multi-core programming OpenMP: ideal for running same program on different inputs MPI: master slave setup that allows message passing Graphics Processing Units: Equipped with hundred to thousand cores Designed for running in parallel hundreds of short functions called threads
GPU programming Memory has four types with different sizes and access times Global: largest, ranges from 3 to 6GB, slow access time Local: same as global but specific to a thread Shared: on-chip, fastest, and limited to threads in a block Constant: cached global memory and accessible by all threads Coalescent memory access is key to fast GPU programs. Main idea is that consecutive threads access consecutive memory locations.
GPU programming Designed for running in parallel hundreds of short functions called threads Threads are organized into blocks which are in turn organized into grids Ideal for running the same function on millions of different inputs
Languages CUDA: C-like language introduced by NVIDIA CUDA programs run only on NVIDIA GPUs OpenCL: OpenCL programs run on all GPUs Same as C Requires no special compiler except for opencl header and object files (both easily available)
CUDA We will compile and run a program for determining interacting SNPs in a genome- wide association study Location: http://www.cs.njit.edu/usman/Chi8