Application and Software Classifications for Big Data Convergence

Application and Software Classifications for Big Data Convergence
Slide Note
Embed
Share

Combine NAS Parallel Benchmarks, Berkeley Dwarfs, and NIST Big Data with High Performance Computing to create a unified framework linking Big Data and Simulation. Explore use cases ranging from Government Operations to Energy in a comprehensive classification system. Discover detailed applications in fields like Healthcare, Deep Learning, and Earth Science, showcasing the potential of Big Data convergence.

  • Big Data
  • High Performance Computing
  • Application Classifications
  • Big Simulation
  • Convergence

Uploaded on Mar 12, 2025 | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.

E N D

Presentation Transcript


  1. Application and Software Classifications that motivate Big Data and Big Simulation Convergence HPC 2016 HIGH PERFORMANCE COMPUTING FROM CLOUDS AND BIG DATA TO EXASCALE AND BEYOND June 27- July 1 2016 Cetraro http://www.hpcc.unical.it/hpc2016/ Geoffrey Fox June 28, 2016 gcf@indiana.edu http://www.dsc.soic.indiana.edu/, http://spidal.org/ Department of Intelligent Systems Engineering School of Informatics and Computing, Digital Science Center Indiana University Bloomington http://hpc-abds.org/kaleidoscope/ 1

  2. Abstract We combine NAS Parallel Benchmarks, Berkeley Dwarfs, the Computational Giants of NRC Massive Data Analysis Report and the NIST Big Data use cases to get an application classification -- the convergence diamonds that links Big Data and Big Simulation in a unified framework. We combine this with High Performance Computing enhanced Apache Big Data software Stack HPC-ABDS and suggest a simple approach to computing systems that support data management, analytics, visualization and simulations without sacrificing performance. We describe a set of "software defined" application exemplars using an Ansible DevOps tool Cloudmesh 5/17/2016 2

  3. NIST Big Data Initiative Use Cases and Properties Led by Chaitin Baru, Bob Marcus, Wo Chang 3 02/16/2016

  4. 51 Detailed Use Cases: Contributed July-September 2013 Covers goals, data features such as 3 V s, software, hardware Government Operation(4): National Archives and Records Administration, Census Bureau Commercial(8): Finance in Cloud, Cloud Backup, Mendeley (Citations), Netflix, Web Search, Digital Materials, Cargo shipping (as in UPS) Defense(3): Sensors, Image surveillance, Situation Assessment Healthcare and Life Sciences(10): Medical records, Graph and Probabilistic analysis, Pathology, Bioimaging, Genomics, Epidemiology, People Activity models, Biodiversity Deep Learning and Social Media(6): Driving Car, Geolocate images/cameras, Twitter, Crowd Sourcing, Network Science, NIST benchmark datasets The Ecosystem for Research(4): Metadata, Collaboration, Language Translation, Light source experiments Astronomy and Physics(5): Sky Surveys including comparison to simulation, Large Hadron Collider at CERN, Belle Accelerator II in Japan Earth, Environmental and Polar Science(10): Radar Scattering in Atmosphere, Earthquake, Ocean, Earth Observation, Ice sheet Radar scattering, Earth radar mapping, Climate simulation datasets, Atmospheric turbulence identification, Subsurface Biogeochemistry (microbes to watersheds), AmeriFlux and FLUXNET gas sensors Energy(1): Smart grid Published by NIST as http://nvlpubs.nist.gov/nistpubs/SpecialPublications/NIST.SP.1500-3.pdf Version 2 being prepared 26 Features for each use case Biased to science 02/16/2016 4

  5. Features of 51 Use Cases I PP (26) All Pleasingly Parallel or Map Only MR (18) Classic MapReduce MR (add MRStat below for full count) MRStat (7) Simple version of MR where key computations are simple reduction as found in statistical averages such as histograms and averages MRIter (23) Iterative MapReduce or MPI (Flink, Spark, Twister) Graph (9) Complex graph data structure needed in analysis Fusion (11) Integrate diverse data to aid discovery/decision making; could involve sophisticated algorithms or could just be a portal Streaming (41) Some data comes in incrementally and is processed this way Classify (30) Classification: divide data into categories S/Q (12) Index, Search and Query 02/16/2016 5

  6. Features of 51 Use Cases II CF (4) Collaborative Filtering for recommender engines LML (36) Local Machine Learning (Independent for each parallel entity) application could have GML as well GML (23) Global Machine Learning: Deep Learning, Clustering, LDA, PLSI, MDS, Large Scale Optimizations as in Variational Bayes, MCMC, Lifted Belief Propagation, Stochastic Gradient Descent, L-BFGS, Levenberg-Marquardt . Can call EGO or Exascale Global Optimization with scalable parallel algorithm Workflow (51) Universal GIS (16) Geotagged data and often displayed in ESRI, Microsoft Virtual Earth, Google Earth, GeoServer etc. HPC (5) Classic large-scale simulation of cosmos, materials, etc. generating (visualization) data Agent (2) Simulations of models of data-defined macroscopic entities represented as agents 02/16/2016 6

  7. http://hpc-abds.org/kaleidoscope/survey/ Online Use Case Form 02/16/2016 7

  8. Data and Model in Big Data and Simulations Need to discuss Data and Model as problems combine them, but we can get insight by separating which allows better understanding of Big Data - Big Simulation convergence (or differences!) Big Data implies Data is large but Model varies (Judy Qiu talk) e.g. LDA with many topics or deep learning has large model Clustering or Dimension reduction can be quite small in model size Simulations can also be considered as Data and Model Model is solving particle dynamics or partial differential equations Data could be small when just boundary conditions Data large with data assimilation (weather forecasting) or when data visualizations are produced by simulation Data often static between iterations (unless streaming); Model varies between iterations 8 5/17/2016

  9. 7 Computational Giants of NRC Massive Data Analysis Report http://www.nap.edu/catalog.php?record_id=18374 Big Data Models? 1) G1: 2) G2: 3) G3: 4) G4: 5) G5: 6) G6: 7) G7: Basic Statistics e.g. MRStat Generalized N-Body Problems Graph-Theoretic Computations Linear Algebraic Computations Optimizations e.g. Linear Programming Integration e.g. LDA and other GML Alignment Problems e.g. BLAST 02/16/2016 9

  10. HPC (Simulation) Benchmark Classics Linpack or HPL: Parallel LU factorization for solution of linear equations; HPCG NPB version 1: Mainly classic HPC solver kernels MG: Multigrid CG: Conjugate Gradient FT: Fast Fourier Transform IS: Integer sort EP: Embarrassingly Parallel BT: Block Tridiagonal SP: Scalar Pentadiagonal LU: Lower-Upper symmetric Gauss Seidel Simulation Models 02/16/2016 10

  11. 13 Berkeley Dwarfs Dense Linear Algebra Sparse Linear Algebra Spectral Methods N-Body Methods Structured Grids Unstructured Grids MapReduce Combinational Logic Graph Traversal 10) Dynamic Programming 11) Backtrack and Branch-and-Bound 12) Graphical Models 13) Finite State Machines 1) 2) 3) 4) 5) 6) 7) 8) 9) Largely Models for Data or Simulation First 6 of these correspond to Colella s original. (Classic simulations) Monte Carlo dropped. N-body methods are a subset of Particle in Colella. Note a little inconsistent in that MapReduce is a programming model and spectral method is a numerical method. Need multiple facets to classify use cases! 02/16/2016 11

  12. Classifying Use cases 02/16/2016 12

  13. Classifying Use Cases Take 51 NIST and other use cases derive multiple specific features Generalize and systematize with features termed facets 50 Facets (Big Data) termed Ogres divided into 4 sets or views where each view has similar facets Add simulations and look separately at Data and Model gives 64 Facets describing Big Simulation and Data termed Convergence Diamonds looking at either data or model or their combination Allows one to study coverage of benchmark sets and architectures 5/17/2016 13

  14. 02/16/2016 Data Source and Style View Geospatial Information System HPC Simulations Internet of Things Metadata/Provenance Shared / Dedicated / Transient / Permanent Archived/Batched/Streaming HDFS/Lustre/GPFS Files/Objects Enterprise Data Model SQL/NoSQL/NewSQL 10 Processing View 9 8 7 6 5 4 Optimization Methodology Linear Algebra Kernels Search / Query / Index Micro-benchmarks Recommendations Graph Algorithms Global Analytics Local Analytics 3 2 1 Base Statistics Classification Visualization Alignment Streaming Learning 4 Ogre Views and 50 Facets 1 2 3 4 5 6 7 8 9 10 12 13 14 11 14 13 12 11 10 9 8 7 6 5 4 3 2 1 Volume Performance Metrics Metric = M / Non-Metric = N = NN /= N Dynamic = D / Static = S Execution Environment; Core libraries Velocity Variety Veracity Communication Structure Regular = R / Irregular = I Data Abstraction Flops per Byte; Memory I/O Iterative / Simple 1 Pleasingly Parallel Classic MapReduce Map-Collective Map Point-to-Point Map Streaming 2 3 4 5 Problem Architecture View 6 7 8 9 Shared Memory Single Program Multiple Data Bulk Synchronous Parallel Fusion Dataflow Agents Workflow 10 11 12 Execution View 14

  15. 64 Features in 4 views for Unified Classification of Big Data and Simulation Applications Both Geospatial Information System HPC Simulations 10D 9 8D Data Source and Style View (Nearly all Data) Global (Analytics/Informatics/Simulations) Local (Analytics/Informatics/Simulations) Simulations Analytics Linear Algebra Kernels/Many subclasses Internet of Things Metadata/Provenance Shared / Dedicated / Transient / Permanent 7D (Model for Big Data) 6D Evolution of Discrete Systems Archived/Batched/Streaming S1, S2, S3, S4, S5 5D Streaming Data Algorithms Optimization Methodology Data Search/Query/Index HDFS/Lustre/GPFS 4D Nature of mesh if used Recommender Engine Iterative PDE Solvers Base Data Statistics Particles and Fields Files/Objects Enterprise Data Model SQL/NoSQL/NewSQL Multiscale Method Data Classification Micro-benchmarks 3D 2D 1D Graph Algorithms Spectral Methods N-body Methods Data Alignment Core Libraries Visualization Learning Convergence Diamonds Views and Facets D 4 M 4 D 5 D 6 M 6 M 8 D 9 M 9 D 10 M 10 M 11 D 12 M 12 D 13 M 13 M 14 1 2 3 7 22 M 21 M 20 M 19 M 18 M 17 M 16 M 11 M 10 M 9 M 8 M 7 M 6 M 5 M 4 M 15 M 14 M 13 M 12 M 3 M 2 M 1 M Data Metric = M / Non-Metric =N Data Metric = M / Non-Metric =N =NN /= N Execution Environment; Core libraries Dynamic = D/ Static = S Dynamic = D/ Static = S Model Abstraction Performance Metrics Flops per Byte/Memory IO/Flops per watt Data Volume Model Size Data Velocity Data Variety Model Variety Veracity Communication Structure Regular=R / Irregular =I Data Regular=R / Irregular =I Model Data Abstraction Iterative /Simple Simulation (Exascale) Processing Diamonds Big Data Processing Diamonds 1 2 Pleasingly Parallel Classic MapReduce Map-Collective Map Point-to-Point Processing View (All Model) 3 4 5 Map Streaming Shared Memory 6 Single Program Multiple Data Bulk Synchronous Parallel 7 8 Execution View (Mix of Data and Model) Fusion Dataflow 9 10 Agents 11M Problem Architecture View (Nearly all Data+Model) Workflow 12 5/17/2016 15

  16. Convergence Diamonds and their 4 Views I One view is the overall problem architecture or macropatterns which is naturally related to the machine architecture needed to support application. Unchanged from Ogres and describes properties of problem such as Pleasing Parallel or Uses Collective Communication The execution (computational) features or micropatterns view, describes issues such as I/O versus compute rates, iterative nature and regularity of computation and the classic V s of Big Data: defining problem size, rate of change, etc. Significant changes from ogres to separate Data and Model and add characteristics of Simulation models. e.g. both model and data have V s ; Data Volume, Model Size e.g. O(N2) Algorithm relevant to big data or big simulation model

  17. Convergence Diamonds and their 4 Views II The data source & styleview includes facets specifying how the data is collected, stored and accessed. Has classic database characteristics Simulations can have facets here to describe input or output data Examples: Streaming, files versus objects, HDFS v. Lustre Processingview has model (not data) facets which describe types of processing steps including nature of algorithms and kernels by model e.g. Linear Programming, Learning, Maximum Likelihood, Spectral methods, Mesh type, mix of Big Data Processing View and Big Simulation Processing View and includes some facets like uses linear algebra needed in both: has specifics of key simulation kernels and in particular includes facets seen in NAS Parallel Benchmarks and Berkeley Dwarfs Instances of Diamonds are particular problems and a set of Diamond instances that cover enough of the facets could form a comprehensive benchmark/mini-app set Diamonds and their instances can be atomic or composite

  18. HPC-ABDS 02/16/2016 18

  19. HPC-ABDS Kaleidoscope of (Apache) Big Data Stack (ABDS) and HPC Technologies 17) Workflow-Orchestration: ODE, ActiveBPEL, Airavata, Pegasus, Kepler, Swift, Taverna, Triana, Trident, BioKepler, Galaxy, IPython, Dryad, Naiad, Oozie, Tez, Google FlumeJava, Crunch, Cascading, Scalding, e-Science Central, Azure Data Factory, Google Cloud Dataflow, NiFi (NSA), Jitterbit, Talend, Pentaho, Apatar, Docker Compose, KeystoneML 16) Application and Analytics: Mahout , MLlib , MLbase, DataFu, R, pbdR, Bioconductor, ImageJ, OpenCV, Scalapack, PetSc, PLASMA MAGMA, Azure Machine Learning, Google Prediction API & Translation API, mlpy, scikit-learn, PyBrain, CompLearn, DAAL(Intel), Caffe, Torch, Theano, DL4j, H2O, IBM Watson, Oracle PGX, GraphLab, GraphX, IBM System G, GraphBuilder(Intel), TinkerPop, Parasol, Dream:Lab, Google Fusion Tables, CINET, NWB, Elasticsearch, Kibana, Logstash, Graylog, Splunk, Tableau, D3.js, three.js, Potree, DC.js, TensorFlow, CNTK 15B) Application Hosting Frameworks: Google App Engine, AppScale, Red Hat OpenShift, Heroku, Aerobatic, AWS Elastic Beanstalk, Azure, Cloud Foundry, Pivotal, IBM BlueMix, Ninefold, Jelastic, Stackato, appfog, CloudBees, Engine Yard, CloudControl, dotCloud, Dokku, OSGi, HUBzero, OODT, Agave, Atmosphere 15A) High level Programming: Kite, Hive, HCatalog, Tajo, Shark, Phoenix, Impala, MRQL, SAP HANA, HadoopDB, PolyBase, Pivotal HD/Hawq, Presto, Google Dremel, Google BigQuery, Amazon Redshift, Drill, Kyoto Cabinet, Pig, Sawzall, Google Cloud DataFlow, Summingbird, Lumberyard 14B) Streams: Storm, S4, Samza, Granules, Neptune, Google MillWheel, Amazon Kinesis, LinkedIn, Twitter Heron, Databus, Facebook Puma/Ptail/Scribe/ODS, AzureStream Analytics, Floe, Spark Streaming, Flink Streaming, DataTurbine 14A) Basic Programming model and runtime, SPMD, MapReduce: Hadoop, Spark, Twister, MR-MPI, Stratosphere (Apache Flink), Reef, Disco, Hama, Giraph, Pregel, Pegasus, Ligra, GraphChi, Galois, Medusa-GPU, MapGraph, Totem 13) Inter process communication Collectives, point-to-point, publish-subscribe: MPI, HPX-5, Argo BEAST HPX-5 BEAST PULSAR, Harp, Netty, ZeroMQ, ActiveMQ, RabbitMQ, NaradaBrokering, QPid, Kafka, Kestrel, JMS, AMQP, Stomp, MQTT, Marionette Collective, Public Cloud: Amazon SNS, Lambda, Google Pub Sub, Azure Queues, Event Hubs 12) In-memory databases/caches: Gora (general object from NoSQL), Memcached, Redis, LMDB (key value), Hazelcast, Ehcache, Infinispan, VoltDB, H-Store 12) Object-relational mapping: Hibernate, OpenJPA, EclipseLink, DataNucleus, ODBC/JDBC 12) Extraction Tools: UIMA, Tika 11C) SQL(NewSQL): Oracle, DB2, SQL Server, SQLite, MySQL, PostgreSQL, CUBRID, Galera Cluster, SciDB, Rasdaman, Apache Derby, Pivotal Greenplum, Google Cloud SQL, Azure SQL, Amazon RDS, Google F1, IBM dashDB, N1QL, BlinkDB, Spark SQL 11B) NoSQL: Lucene, Solr, Solandra, Voldemort, Riak, ZHT, Berkeley DB, Kyoto/Tokyo Cabinet, Tycoon, Tyrant, MongoDB, Espresso, CouchDB, Couchbase, IBM Cloudant, Pivotal Gemfire, HBase, Google Bigtable, LevelDB, Megastore and Spanner, Accumulo, Cassandra, RYA, Sqrrl, Neo4J, graphdb, Yarcdata, AllegroGraph, Blazegraph, Facebook Tao, Titan:db, Jena, Sesame Public Cloud: Azure Table, Amazon Dynamo, Google DataStore 11A) File management: iRODS, NetCDF, CDF, HDF, OPeNDAP, FITS, RCFile, ORC, Parquet 10) Data Transport: BitTorrent, HTTP, FTP, SSH, Globus Online (GridFTP), Flume, Sqoop, Pivotal GPLOAD/GPFDIST 9) Cluster Resource Management: Mesos, Yarn, Helix, Llama, Google Omega, Facebook Corona, Celery, HTCondor, SGE, OpenPBS, Moab, Slurm, Torque, Globus Tools, Pilot Jobs 8) File systems: HDFS, Swift, Haystack, f4, Cinder, Ceph, FUSE, Gluster, Lustre, GPFS, GFFS Public Cloud: Amazon S3, Azure Blob, Google Cloud Storage 7) Interoperability: Libvirt, Libcloud, JClouds, TOSCA, OCCI, CDMI, Whirr, Saga, Genesis 6) DevOps: Docker (Machine, Swarm), Puppet, Chef, Ansible, SaltStack, Boto, Cobbler, Xcat, Razor, CloudMesh, Juju, Foreman, OpenStack Heat, Sahara, Rocks, Cisco Intelligent Automation for Cloud, Ubuntu MaaS, Facebook Tupperware, AWS OpsWorks, OpenStack Ironic, Google Kubernetes, Buildstep, Gitreceive, OpenTOSCA, Winery, CloudML, Blueprints, Terraform, DevOpSlang, Any2Api 5) IaaS Management from HPC to hypervisors: Xen, KVM, QEMU, Hyper-V, VirtualBox, OpenVZ, LXC, Linux-Vserver, OpenStack, OpenNebula, Eucalyptus, Nimbus, CloudStack, CoreOS, rkt, VMware ESXi, vSphere and vCloud, Amazon, Azure, Google and other public Clouds Networking: Google Cloud DNS, Amazon Route 53 Green is HPC work of NSF14-43054 Cross- Cutting Functions 1) Message and Data Protocols: Avro, Thrift, Protobuf 2) Distributed Coordination : Google Chubby, Zookeeper, Giraffe, JGroups 3) Security & Privacy: InCommon, Eduroam OpenStack Keystone, LDAP, Sentry, Sqrrl, OpenID, SAML OAuth 4) Monitoring: Ambari, Ganglia, Nagios, Inca 21 layers Over 350 Software Packages January 29 2016 5/17/2016 19

  20. Implementing HPC-ABDS Building high performance data analytics library in NSF14-43054 Dibbs SPIDAL building blocks (my next talk Thursday) Use C++, Python or Java Grande as languages Software Philosophy enhance existing ABDS; not standalone software Use Heron, Storm, Hadoop, Spark, Flink, Hbase, Yarn, Mesos Define MPI community as source of best-possible inter-process communication; need to enhance MPI distribution as HPC nearest neighbor and big data mainly collectives Spark, Flink, Heron are best distributed computing dataflow engines that differ on streaming support? Judy Qiu will describe Harp as HPC Hadoop plug-in Working with Apache; how should one do this? Establish a standalone HPC project Join existing Apache projects and contribute HPC enhancements Simple Apache experiment with Twitter (Apache) Heron to build HPC Heron that supports science use cases (big images) based on earlier work with Storm 5/17/2016 20

  21. Functionality of 21 HPC-ABDS Layers 1) Message Protocols: 2) Distributed Coordination: 3) Security & Privacy: 4) Monitoring: 5) IaaS Management from HPC to hypervisors: 6) DevOps: 7) Interoperability: 8) File systems: 9) Cluster Resource Management: 10) Data Transport: 11) A) File management B) NoSQL C) SQL 12) In-memory databases & caches / Object-relational mapping / Extraction Tools 13) Inter process communication Collectives, point-to-point, publish- subscribe, MPI: 14) A) Basic Programming model and runtime, SPMD, MapReduce: B) Streaming: 15) A) High level Programming: B) Frameworks 16) Application and Analytics: 17) Workflow-Orchestration: 02/16/2016 21

  22. Typical Big Data Pattern 2. Perform real time analytics on data source streams and notify users when specified events occur Specify filter Filter Identifying Events Post Selected Events Fetch streamed Data Streaming Data Identified Events Posted Data Streaming Data Archive Streaming Data Repository Storm (Heron), Kafka, Hbase, Zookeeper 02/16/2016 22

  23. Typical Big Data Pattern 5A. Perform interactive analytics on observational scientific data Science Analysis Code, Mahout, R, SPIDAL Grid or Many Task Software, Hadoop, Spark, Giraph, Pig Data Storage: HDFS, Hbase, File Collection Direct Transfer Transport batch of data to primary analysis data system Streaming Twitter data for Social Networking NIST examples include LHC, Remote Sensing, Astronomy and Bioinformatics Local Accumulate and initial computing Record Scientific Data in field 02/16/2016 23

  24. Improvement of Storm (Heron) using HPC communication algorithms Improvedment/Serial For 3 algorithms 5/17/2016 24

  25. HPC-ABDS Activities of NSF14-43054 Level 17: Orchestration: Apache Beam (Google Cloud Dataflow) Level 16: Applications: Datamining for molecular dynamics, Image processing for remote sensing and pathology, graphs, streaming, bioinformatics, social media, financial informatics, text mining Level 16: Algorithms: Generic and application specific; SPIDAL Library Level 14: Programming: Storm, Heron (Twitter replaces Storm), Hadoop, Spark, Flink. Improve Inter- and Intra-node performance; science data structures Level 13: Runtime Communication: Enhanced Storm and Hadoop (Spark, Flink, Giraph) using HPC runtime technologies, Harp Level 11: Data management: Hbase and MongoDB integrated via use of Beam and other Apache tools; enhance Hbase Level 9: Cluster Management: Integrate Pilot Jobs with Yarn, Mesos, Spark, Hadoop; integrate Storm and Heron with Slurm Level 6: DevOps: Python Cloudmesh virtual Cluster Interoperability 5/17/2016 25

  26. Typical Convergence Architecture Running same HPC-ABDS software across all platforms but data management machine has different balance in I/O, Network and Compute from model machine Model has similar issues whether from Big Data or Big Simulation. C D D C C C D D C D D C C C D D C C C C C D D C C D D C D D C D D C C C C C C C C D D C D D C D D C C C C D D C C C C C C D D C D D C D D C C C C D D C C C C C Data Management Model for Big Data and Big Simulation 02/16/2016 26

  27. Java Performance with Optimization 128 24 core Haswell nodes on SPIDAL DA-MDS Code Speedup compared to 1 process per node on 48 nodes Best MPI; inter and intra node MPI; inter/intra node; Java not optimized Best Threads intra node; MPI inter node 02/16/2016 27 HPC Enhancement ~factor of 10

  28. Converged Failure in HPF Blackhole? Or where big data differs from simulations? Database community looks at big data job as a dataflow of (SQL) queries and filters Apache projects like Pig, MRQL and Flink (Volker Markl) aim at automatic query optimization by dynamic integration of queries and filters including iteration and different data analytics functions Going back to ~1993, High Performance Fortran HPF compilers optimized set of array and loop operations for large scale parallel execution HPF worked fine for initial simple regular applications but ran into trouble for cases where parallelism hard (irregular, dynamic) Will same happen in Big Data world? Straightforward to parallelize k-means clustering but sophisticated algorithms like Elkans method (use triangle inequality) and fuzzy clustering are much harder (but not used much NOW) Will Big Data technology run into HPF-style trouble with growing use of sophisticated data analytics? 5/17/2016 28

  29. Constructing HPC-ABDS Exemplars This is one of next steps in NIST Big Data Working Group Jobs are defined hierarchically as a combination of Ansible (preferred over Chef or Puppet as Python) scripts Scripts are invoked on Infrastructure (Cloudmesh Tool) INFO 524 Big Data Open Source Software Projects IU Data Science class required final project to be defined in Ansible and decent grade required that script worked (On NSF Chameleon and FutureSystems) 80 students gave 37 projects with ~15 pretty good such as Machine Learning benchmarks on Hadoop with HiBench , Hadoop/YARN, Spark, Mahout, Hbase Human and Face Detection from Video , Hadoop, Spark, OpenCV, Mahout, MLLib Build up curated collection of Ansible scripts defining use cases for benchmarking, standards, education https://docs.google.com/document/d/1OCPO2uqOkADvoxynRyZwh5IyFQ2_m1fkpBVMo3UBblg Fall 2015 class INFO 523 introductory data science class was less constrained; students just had to run a data science application 140 students: 45 Projects (NOT required) with 91 technologies, 39 datasets 5/17/2016 29

  30. Cloudmesh Interoperability DevOps Tool Model: Define software configuration with tools like Ansible (Chef, Puppet); instantiate on a virtual cluster Save scripts not virtual machines and let script build applications Cloudmesh is an easy-to-use command line program/shell and portal to interface with heterogeneous infrastructures taking script as input It first defines virtual cluster and then instantiates script on it Supports OpenStack, AWS, Azure, SDSC Comet, virtualbox, libcloud supported clouds as well as classic HPC and Docker infrastructures Has an abstraction layer that makes it possible to integrate other IaaS frameworks Managing VMs across different IaaS providers is easier Demonstrated interaction with various cloud providers: FutureSystems, Chameleon Cloud, Jetstream, CloudLab, Cybera, AWS, Azure, virtualbox Status: AWS, and Azure, VirtualBox, Docker need improvements; we focus currently on SDSC Comet and NSF resources that use OpenStack HPC Cloud Interoperability Layer 5/17/2016 30

  31. Structure of Software Defined Big Data Exemplars Github (Ansible Galaxy) collects basic Ansible roles Exemplar (student project) may add specialized roles and defines a project Ansible playbook executed by a Cloudmesh cm script such as cm launcher hibench parameterA=40 parameterB=xyz . cloud=chameleon Typical Playbook is short include role python include role hadoop include role pig include role fetch data include role execute benchmark Figure illustrates testing a new infrastructure or code change 5/17/2016 31

  32. Components in Big Data HPC Convergence Applications, Benchmarks and Libraries 51 NIST Big Data Use Cases, 7 Computational Giants of the NRC Massive Data Analysis, 13 Berkeley dwarfs, 7 NAS parallel benchmarks Unified discussion by separately discussing data & model for each application; 64 facets Convergence Diamonds -- characterize applications Characterization identifies hardware and software features for each application across big data, simulation; complete set of benchmarks (NIST) Software Architecture and its implementation HPC-ABDS: Cloud-HPC interoperable software: performance of HPC (High Performance Computing) and the rich functionality of the Apache Big Data Stack. Added HPC to Hadoop, Storm, Heron, Spark; will add to Beam and Flink Work in Apache model contributing code Run same HPC-ABDS across all platforms but data management nodes have different balance in I/O, Network and Compute from model nodes Optimize to data and model functions as specified by convergence diamonds Do not optimize for simulation and big data Convergence Language: Make C++, Java, Scala, Python perform well 5/17/2016 32

More Related Content