Scalable Data Science Middleware and Analytics Libraries for High Performance Computing

1 / 10

Embed Share

Explore the NSF Dibbs Award project focusing on scalable data science middleware and high-performance analytics libraries for various scientific fields such as biomolecular simulations, network science, epidemiology, computer vision, and spatial geographical information systems. The project involves collaboration among multiple institutions to develop middleware and analytics solutions aimed at enhancing scalability and performance in data-driven research across different domains.

jmohs Follow

Uploaded on Apr 04, 2025 | 0 Views

Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.

Download Presentation

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.

E N D

Presentation Transcript

NSF Dibbs Award 5 yr. Datanet: CIF21 DIBBs: Middleware and High Performance Analytics Libraries for Scalable Data Science IU(Fox, Qiu, Crandall, von Laszewski), Rutgers (Jha), Virginia Tech (Marathe), Kansas (Paden), Stony Brook (Wang), Arizona State(Beckstein), Utah(Cheatham) HPC-ABDS: Cloud-HPC interoperable software performance of HPC (High Performance Computing) and the rich functionality of the commodity Apache Big Data Stack. SPIDAL (Scalable Parallel Interoperable Data Analytics Library): Scalable Analytics for Biomolecular Simulations, Network and Computational Social Science, Epidemiology, Computer Vision, Spatial Geographical Information Systems, Remote Sensing for Polar Science and Pathology Informatics. 1

Year 1 Year 2 Years 3-5 Community requirement and technology evaluation (i) Arch and design spec (ii) In-memory pilot abstract., integrate with XSEDE SPIDAL-MIDAS Interface and SPIDAL V1.0 SPIDAL scheduling components and execution proceesing. MIDAS on Blue Waters. V1.0 release Integrated testing with Algorithms & MIDAS. Extend to V2.0 Scalability testing, adaptors for new platforms, Support for tools and developers, Optimization, Phase II of execution-processing models,V2.0 (i) Parallel Trajectory and MDAnalysis with MR (ii) iBIOMES data mgmt. in MIDAS (iii) End-to- end Integration of CPPTraj- MIDAS with SPIDAL (iv) Use SPIDAL Kmeans (v) Tutorials and outreach i) Algorithm implementation for subgraph problems ii) Develop new algorithms as necessary i) Implement the wrappers ii) Start implementing Giraph- based tool iii) Integrate EpiSimdemics and Epifast with SPIDAL (i) Implementation of 3D spatial queries. (ii) Application to 3D pathology (i) Continued implementation of 3D image processing library (ii) Application to liver and neuroblastoma (i) Continue implementing ML and global optimization; (ii) large-scale 3D recognition in social images Develop and implement (i) change detection and (ii) flow field estimation in satellite images. SPIDAL MIDAS Community requirements gathering CPPTRAJ to integrate with MIDAS for ensemble analysis on Blue Waters Community: HPC Biomolecular Simulations i) Gather community requirement ii) study existing network analytic algorithms i) Giraph-based clustering and community detection problems ii) Integ of CINET in SPIDAL Community: Network Science and Comp. Social Science Community requirement gathering Design i) Wrapper for EpiSimdemics and EpiFast ii) Giraph simulation tool Community: Computational Epidemiology (i) (ii) Community reqs Spatial queries library and 2D parallel (i) Implementation of 2D image preproc., segment and feature extraction and tumor research (i) (ii) spatial 2D clustering and Geospatial & pathology apps Image registration, object matching & feature extraction (3D) Integrate MIDAS Implement ML and optimization algorithms; large-scale image recognition (i) Develop and implement continent-scale layer finding Community: Spatial (i) Community: Pathology (ii) (i) Port image processing, feature extraction, image matching, pleasingly parallel ML algos Community: Computer vision: (ii) (i) single-echogram layer finding, tile matching Community: Radar informatics: 2 (ii)

Machine Learning in Network Science, Imaging in Computer Vision, Pathology, Polar Science, Biomolecular Simulations Algorithm Applications Features Status Parallelism Graph Analytics Community detection Social networks, webgraph P-DM GML-GrC Subgraph/motif finding Webgraph, biological/social networks P-DM GML-GrB Finding diameter Social networks, webgraph P-DM GML-GrB Graph . Clustering coefficient Social networks P-DM GML-GrC Page rank Webgraph P-DM GML-GrC Maximal cliques Social networks, webgraph P-DM GML-GrB Connected component Social networks, webgraph P-DM GML-GrB Betweenness centrality Social networks P-Shm Graph, static Non-metric, GML-GRA Shortest path Social networks, webgraph P-Shm Spatial Queries and Analytics Spatial queries relationship based P-DM PP GIS/social networks/pathology informatics Distance based queries P-DM PP Geometric Spatial clustering Seq GML Spatial modeling Seq PP GML Global (parallel) ML GrA Static GrB Runtime partitioning 3

Some specialized data analytics in SPIDAL Algorithm Applications Features Status Parallelism aa Core Image Processing Image preprocessing P-DM PP Object detection & segmentation P-DM PP Metric Space Point Sets, Neighborhood sets & Image features Image/object feature computation Computer vision/pathology informatics P-DM PP 3D image registration Seq PP Object matching Todo PP Geometric 3D feature extraction Todo PP Deep Learning Learning Network, Stochastic Gradient Descent Image Understanding, Language Translation, Voice Recognition, Car driving Connections in artificial neural net P-DM GML PP Pleasingly Parallel (Local ML) Seq Sequential Available GRA Good distributed algorithm needed Todo No prototype Available P-DM Distributed memory Available P-Shm Shared memory Available 4

Some Core Machine Learning Building Blocks Algorithm Applications Features Status //ism DA Vector Clustering DA Non metric Clustering Kmeans; Basic, Fuzzy and Elkan Levenberg-Marquardt Optimization Accurate Clusters Accurate Clusters, Biology, Web Non metric, O(N2) Vectors P-DM GML P-DM GML Fast Clustering Non-linear Gauss-Newton, use in MDS Vectors P-DM GML Least Squares P-DM GML Least O(N2) Vectors Squares, SMACOF Dimension Reduction DA- MDS with general weights P-DM GML Vector Dimension Reduction DA-GTM and Others Find nearest document corpus Find pairs of documents with TFIDF distance threshold P-DM GML neighbors in TFIDF Search P-DM PP Bag (image features) of words All-pairs similarity search below a Todo GML Support Vector Machine SVM Learn and Classify Vectors Seq GML Random Forest Gibbs sampling (MCMC) Latent Dirichlet Allocation LDA with Gibbs sampling or Var. Bayes Singular Value Decomposition SVD Learn and Classify Vectors P-DM PP Solve global inference problems Graph Todo GML Topic models (Latent factors) Bag of words P-DM GML Dimension Reduction and PCA Vectors Seq GML 5 Global inference on sequence models PP & GML Hidden Markov Models (HMM) Vectors Seq

Relevant DSC and XSEDE Computing Systems DSC adding128 node Haswell based (2 chips, 24 or 36 cores per node) system (Juliet) 128 GB memory per node Substantial conventional disk per node (8TB) plus PCI based 400 GB SSD Infiniband with SR-IOV Back end Lustre Older or Very Old (tired) machines India (128 nodes, 1024 cores), Bravo (16 nodes, 128 cores), Delta(16 nodes, 192 cores), Echo(16 nodes, 192 cores), Tempest (32 nodes, 768 cores); some with large memory, large disk and GPU Cray XT5m with 672 cores Optimized for Cloud research and Large scale Data analytics exploring storage models, algorithms Bare-metal v. Openstack virtual clusters Extensively used in Education XSEDE Wrangler and Comet likely to be especially useful 6

Kaleidoscope of (Apache) Big Data Stack (ABDS) and HPC Technologies 17) Workflow-Orchestration: ODE, ActiveBPEL, Airavata, Pegasus, Kepler, Swift, Taverna, Triana, Trident, BioKepler, Galaxy, IPython, Dryad, Naiad, Oozie, Tez, Google FlumeJava, Crunch, Cascading, Scalding, e-Science Central, Azure Data Factory, Google Cloud Dataflow, NiFi (NSA) 16) Application and Analytics: Mahout , MLlib , MLbase, DataFu, R, pbdR, Bioconductor, ImageJ, Scalapack, PetSc, Azure Machine Learning, Google Prediction API, Google Translation API, mlpy, scikit-learn, PyBrain, CompLearn, DAAL(Intel), Caffe, Torch, Theano, DL4j, H2O, IBM Watson, Oracle PGX, GraphLab, GraphX, IBM System G, GraphBuilder(Intel), TinkerPop, Google Fusion Tables, CINET, NWB, Elasticsearch 15B) Application Hosting Frameworks: Google App Engine, AppScale, Red Hat OpenShift, Heroku, Aerobatic, AWS Elastic Beanstalk, Azure, Cloud Foundry, Pivotal, IBM BlueMix, Ninefold, Jelastic, Stackato, appfog, CloudBees, Engine Yard, CloudControl, dotCloud, Dokku, OSGi, HUBzero, OODT, Agave, Atmosphere 15A) High level Programming: Kite, Hive, HCatalog, Tajo, Shark, Phoenix, Impala, MRQL, SAP HANA, HadoopDB, PolyBase, Presto, Google Dremel, Google BigQuery, Amazon Redshift, Drill, Pig, Sawzall, Google Cloud DataFlow, Summingbird 14B) Streams: Storm, S4, Samza, Granules, Google MillWheel, Amazon Kinesis, LinkedIn Databus, Facebook Puma/Ptail/Scribe/ODS, Azure Stream Analytics 14A) Basic Programming model and runtime, SPMD, MapReduce: Hadoop, Spark, Twister, Stratosphere (Apache Flink), Reef, Hama, Giraph, Pregel, Pegasus, Ligra, GraphChi 13) Inter process communication Collectives, point-to-point, publish-subscribe: MPI, Harp, Netty, ZeroMQ, ActiveMQ, RabbitMQ, NaradaBrokering, QPid, Kafka, Kestrel, JMS, AMQP, Stomp, MQTT, Public Cloud: Amazon SNS, Lambda, Google Pub Sub, Azure Queues, Event Hubs 12) In-memory databases/caches: Gora (general object from NoSQL), Memcached, Redis, LMDB (key value), Hazelcast, Ehcache, Infinispan 12) Object-relational mapping: Hibernate, OpenJPA, EclipseLink, DataNucleus, ODBC/JDBC 12) Extraction Tools: UIMA, Tika 11C) SQL(NewSQL): Oracle, DB2, SQL Server, SQLite, MySQL, PostgreSQL, Galera Cluster, SciDB, Rasdaman, Apache Derby, Google Cloud SQL, Azure SQL, Amazon RDS, Google F1, IBM dashDB, N1QL, BlinkDB 11B) NoSQL: Lucene, Solr, Solandra, Voldemort, Riak, Berkeley DB, MongoDB, Espresso, CouchDB, Couchbase, IBM Cloudant, HBase, Google Bigtable, LevelDB, Megastore and Spanner, Accumulo, Cassandra, RYA, Sqrrl, Neo4J, Yarcdata, AllegroGraph, Facebook Tao, Titan:db, Jena, Sesame Public Cloud: Azure Table, Amazon Dynamo, Google DataStore 11A) File management: iRODS, NetCDF, CDF, HDF, OPeNDAP, FITS, RCFile, ORC, Parquet Big Data Software Model Cutting Functions 1) Message and Data Protocols: Avro, Thrift, Protobuf 2) Distributed Coordination: Google Chubby, Zookeeper, Giraffe, JGroups 3) Security & Privacy: InCommon, Eduroam OpenStack Keystone, LDAP, Sentry, Sqrrl, OpenID, SAML OAuth 4) Monitoring: Ambari, Ganglia, Nagios, Inca 10) Data Transport: BitTorrent, HTTP, FTP, SSH, Globus Online (GridFTP), Flume, Sqoop 9) Cluster Resource Management: Mesos, Yarn, Helix, Llama, Google Omega, Facebook Corona, Celery, HTCondor, SGE, OpenPBS, Moab, Slurm, Torque, Globus Tools, Pilot Jobs 8) File systems: HDFS, Swift, Haystack, f4, Cinder, Ceph, FUSE, Gluster, Lustre, GPFS, GFFS Public Cloud: Amazon S3, Azure Blob, Google Cloud Storage 7) Interoperability: Libvirt, Libcloud, JClouds, TOSCA, OCCI, CDMI, Whirr, Saga, Genesis 6) DevOps: Docker, Puppet, Chef, Ansible, SaltStack, Boto, Cobbler, Xcat, Razor, CloudMesh, Juju, Foreman, OpenStack Heat, Rocks, Cisco Intelligent Automation for Cloud, Ubuntu MaaS, Facebook Tupperware, AWS OpsWorks, OpenStack Ironic, Google Kubernetes, Buildstep, Gitreceive 5) IaaS Management from HPC to hypervisors: Xen, KVM, Hyper-V, VirtualBox, OpenVZ, LXC, Linux-Vserver, OpenStack, OpenNebula, Eucalyptus, Nimbus, CloudStack, CoreOS, VMware ESXi, vSphere and vCloud, Amazon, Azure, Google and other public Clouds, Networking: Google Cloud DNS, Amazon Route 53 2015 Cross- 21 layers Over 300 Software Packages 20 March 7

HPC ABDS SYSTEM (Middleware) >~ 266 Software Projects System Abstraction/Standards Data Format and Storage HPC Yarn for Resource management Horizontally scalable parallel programming model Collective and Point to Point Communication Support for iteration (in memory processing) HPC ABDS Hourglass Application Abstractions/Standards Graphs, Networks, Images, Geospatial .. Scalable Parallel Interoperable Data Analytics Library (SPIDAL) High performance Mahout, R, Matlab .. 9 High Performance Applications

Applications SPIDAL MIDAS ABDS Community & Examples Govt. Operations Commercial Defense Healthcare, Life Science Deep Learning, Social Media Research Ecosystems Astronomy, Physics Earth, Env., Polar Science Energy (Inter)disciplinary Workflow SPIDAL Analytics Libraries Programmin g & Runtime Models MIDAS Native ABDS SQL-engines, Storm, Impala, Hive, Shark MIddleware for Data-Intensive Analytics and Science (MIDAS) API HPC-ABDS MapReduce Native HPC MPI Map Only, PP Many Task Classic MapReduce Map Collective Map Point to Point, Graph Communication Data Systems and Abstractions (In-Memory; HBase, Object Stores, other NoSQL stores, Spatial, SQL, Files) (MPI, RDMA, Hadoop Shuffle/Reduce, HARP Collectives, Giraph point-to-point) Higher-Level Workload Management (Tez, Llama) Workload Management (Pilots, Condor) Framework specific Scheduling (e.g. YARN) Resource Fabric 10 External Data Access (Virtual Filesystem, GridFTP, SRM, SSH) Cluster Resource Manager (YARN, Mesos, SLURM, Torque, SGE) Compute, Storage and Data Resources(Nodes, Cores, Lustre, HDFS)

Scalable Data Science Middleware and Analytics Libraries for High Performance Computing

Download Presentation

Presentation Transcript

Related

More Related Content