
Early Results of Deep Learning on Stampede2 Supercomputer
"Exploring the impact of deep learning on high-performance computing systems such as Stampede2, with insights into methodology, performance, and future possibilities. Discover how scientists are leveraging deep learning for various research challenges in fields like astronomy, drug discovery, and more."
Download Presentation

Please find below an Image/Link to download the presentation.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.
You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.
E N D
Presentation Transcript
Early Results of Deep Learning on the Stampede2 Supercomputer Zhao Zhang, Weijia Xu, Niall Gaffney, Daniel Stanzione Texas Advanced Computing Center zzhang@tacc.utexas.edu Sep 26th, 2017 1
Outline Overview Technical Approach Performance Summary and Future Work 2
Deep Learning as a Methodology Many scientists are exploring and adopting deep learning as the data science methodology to tackle their domain research challenge Astronomy Drug discovery Disease diagnosis Molecular dynamics Neurology Particle physics Social science 3
Deep Learning on HPC Neural network is a good fit for HPC architecture HPC can reduce the training turnaround time Minutes, Hours Interactive research! Instant gratification! 1-4 days Tolerable Interactivity replaced by running many experiments in parallel 1-4 weeks High value experiments only Progress stalls >1 month Don t even try Jonathan Hseu, Google Brain Team 4
TACCs Mission To enable discoveries that advance science and society through the application of advanced computing technologies TACC is embracing this new type of application and the programming paradigm changes it may bring 5
Technical Approach Support a number of popular deep learning frameworks Caffe, MXNet, and TensorFlow at the moment Support multiple modern architectures Intel KNL, Intel Xeon CPU, Nvidia GPU Profile and present the performance of a suite of well- known deep learning applications ConvNet on the Cifar10 dataset, AlexNet, CaffeNet, GoogeNet on the ImageNet dataset 6
Status We measured Caffe, MXNet, and TensorFlow performance on the Stampede2 supercomputer and our GPU-based Maverick cluster Intel Caffe is available as a module on the Stampede2 supercomputer with the user guide in preparation 7
Outline Overview Approach Performance Summary and Future Work 8
Stampede2 Phase 1 of Stampede2 has 4,200 KNLs 96 GB DDR4 and 16 GB MCDRAM 100 Gb/s Intel Omni-Path interconnect A Lustre deployment with ~40 PB storage 9
Software Stack Intel Caffe (self_contained_MKLGOLD_u2) Machine Learning Scaling Library -- MLSL (v2017- Preview) MKLML (2017.0.2.20170110) Intel MPI 17.0.4 MXNet (v0.10.0-79-g790328f) TensorFlow (v1.3.0-rc2) 10
Single-node Performance Caffe Goal: Find out the OMP_NUM_THREADS configuration for best performance on a KNL Compare the MKL2017 kernel and the default kernel Methodology: Fix the workload of ConvNet with Cifar10 dataset and CaffeNet with ImageNet-100 dataset Run the workloads with {32, 64, 128} OpenMP threads 11
Single-Node Performance - Caffe ConvNet CaffeNet MKL2017 has a 1.4- 3.7x speedup compared to the default kernel 64 OpenMP threads deliver the best performance 2000 1500 solution (sec) sec) Time- -to to- -solution ( 1000 Time 500 0 32 64 128 OpenMP Thread Count OpenMP Thread Count 12
Single-node Performance Architecture Comparison Goal: Understand the performance difference between KNL and Nvidia s K40 and P100 Methodology: Fix the workloads Run ConvNet, CaffeNet, AlexNet, and GoogleNet on one KNL, one K40, and one P100 13
Single-node Performance Architecture Comparison ConvNet CaffeNet AlexNet GoogLeNet 2400 One KNL is 2x faster than one K40 One KNL is 40-80% slower than one P100 1800 solution (sec) Time- -to to- -solution (sec) 1200 Time 600 0 1 KNL K40 P100 14
Single-node Performance MXNet and TensorFLow Slowdown summary of MXNet and TensorFlow on KNL MXNet TensorFlow K40 1.2-3.7x 5.0-22.3x P100 5.3-14.1x 18.9-59.0x 15
Multi-node Performance - Strong Scaling Caffe Goal: Understand the strong scaling pattern of Caffe Methodology: Fix the mini-batch size of each model Fix the number of epochs Run ConvNet, CaffeNet, AlexNet, and GoogleNet on {1, 2, 4, 8} KNLs mini-batch size{ Number of Epochs Guaranteed Generalization! mini-batch size{ Number of Epochs 16
Multi-Node Performance Strong Scaling Caffe ConvNet CaffeNet AlexNet GoogLeNet 1400 ~50% strong scaling efficiency 1050 solution (sec) Time- -to to- -solution (sec) 700 Time 350 0 1 KNL 2 KNL 4 KNL 8 KNL 17
Multi-Node Performance Strong Scaling Caffe Cifar10 CaffeNet AlexNet GoogLeNet 1100 2 KNLs deliver similar performance to one P100 for CaffeNet, AlexNet, and GoogleNet 825 solution (sec) Time- -to to- -solution (sec) 550 Time 275 0 2 KNL P100 18
Multi-node Performance Weak Scaling Caffe Goal: to understand the weak scaling pattern of Caffe Methodology: Increase the mini-batch size proportionally to node count Fix the number of epochs Run ConvNet, CaffeNet, AlexNet, and GoogleNet on {128, 256, 512, 1024} KNLs with mini-batch size{ Number of Epochs Guaranteed Generalization? mini-batch size{ Number of Epochs 19
Multi-Node Performance Weak Scaling Caffe ConvNet CaffeNet AlexNet GoogLeNet MLSL stops working Pure MPI mode Known largest working scale: 768 1 0.9 92% 88% 85% 0.8 85% 84% 81% 81% Weak Scaling Efficiency Weak Scaling Efficiency 78% 77% 76% 76% 0.7 71% 70% 0.6 56% 55% 0.5 0.4 0.3 0.2 0.1 0 128 KNLs 256 KNLs 512 KNLs 1024 KNLs 20
Summary TACC has three popular deep learning frameworks, Caffe, MXNet, and TensorFlow running on the Stampede2 supercomputer A single KNL performance with Caffe is ~2x faster than a K40 and ~40%-80% slower than a P100 Strong scaling Caffe with the same mini-batch size has ~50% efficiency on 4 KNLs Weak scaling Caffe with large mini-batch size has ~80% efficiency on 512 KNLs 21
Ongoing and Future Work Improving the Caffe scalability with Intel Deploying the Layer-wise Adaptive Rate Scaling (LARS) algorithm to enable large mini-batch size training with less time Possibly optimize MXNet and TensorFlow performance with Intel 22
THANKS Q&A 23