Scientific Workflows on Amazon Web Services Grid and Cloud

on demand computing for scientific workflows n.w
1 / 13
Embed
Share

Explore how on-demand computing on Amazon Web Services facilitates scientific workflows at Fermilab, optimizing experiments with higher intensity and precision. Learn about access to resources, moving software and data to the cloud, and the hybrid cloud solution. This presentation showcases the innovative use of commercial cloud services for scientific advancements.

  • Scientific Workflows
  • Amazon Web Services
  • Fermilab
  • Cloud Computing
  • Hybrid Cloud

Uploaded on | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.

E N D

Presentation Transcript


  1. On-demand Computing for Scientific Workflows using Amazon Web Services Grid and Cloud Services Department, Fermilab Claudio Pontili Supervisors: Gabriele Garzoglio Steven Timm

  2. Presentation Outline Introduction: why do we want to use Commercial Cloud? GlideinWMS: a simple way to access resources Test: Running up to 1000 simultaneously jobs on Amazon Web Services and FermiCloud Task A: Moving software and data to Commercial Cloud Task B: Adding and removing HTCondor to GlideinWMS using AWS Task C: Hybrid Cloud solution AWS+Fermicloud Conclusions 2 Claudio Pontili | AWS On-Demand 03/04/2025

  3. Introduction: Fermilab Experiment Schedule FY16 Measurements at all frontiers Electroweak physics, neutrino oscillations, muon g-2, dark energy, dark matter 8 major experiments in 3 frontiers running simultaneously in 2016 Sharing both beam and computing resources Impressive breadth of experiments at FNAL 3 Claudio Pontili | AWS On-Demand 03/04/2025

  4. Introduction: Computing requirements for experiments higher intensity and higher precision measurements are driving request for more computing resources than previous small experiments beam simulations to optimize experiments - make every particle count detector design studies - cost effectiveness and sensitivity projections higher bandwidth DAQ and greater detector granularity event generation and detector response simulation reconstruction and analysis algorithms 4 Claudio Pontili | AWS On-Demand 03/04/2025

  5. Introduction: Slots Fermilab, OSG, & Clouds Current full capacity of FermiGrid ~30k slots Full capacity of OSG (Open Science Grid) ~85k slots Additional OSG opportunistic slots 15k 30k Additional per-pay slots at commercial Clouds 5 Claudio Pontili | AWS On-Demand 03/04/2025

  6. Federation via GlideinWMS Grid and Cloud Bursting Unified submit tool for grid and cloud using HTCondor Jobsub_submit Job Queue FrontEnd GlideinWMS Pilot Factory Users Jobs Pilots FermiGrid Fermi Cloud Gcloud KISTI Amazon AWS OSG 6 Claudio Pontili | AWS On-Demand 03/04/2025

  7. Running AWS NovA Jobs as function of time, Oct 23. 2014 1200 3300 jobs 525 m3.large Total cost $449 Prices are getting down 1 hour 1000 800 Jobs 600 400 200 0 2:21 2:28 2:35 2:42 2:49 2:56 3:03 3:10 3:17 3:24 3:31 3:38 3:45 3:52 3:59 4:06 4:13 4:20 4:27 4:34 4:41 4:48 4:55 5:02 Time 5:09 5:16 5:23 5:30 5:37 5:44 5:51 5:58 6:05 6:12 6:19 6:26 6:33 6:40 6:47 6:54 7:01 7:08 7:15 7:22 7:29 7:36 7:43 7:50 7 Claudio Pontili | AWS On-Demand 03/04/2025

  8. Task A: Moving software and data to commercial cloud using Scalable Squid Servers Need to transport software and data to the Cloud Auto-scalable squid servers, deploying and destroying in 30 seconds using CloudFormation script 8 Claudio Pontili | AWS On-Demand 03/04/2025

  9. Task B: Auto-Scaling GlideinWMS adding/removing HTCondor using Amazon Web Services New resources are made available through the WMS (HTCondor) The system is designed to scale by adding servers Problem: the submission system is a stateful service Easy to scale up Hard to scale down Solution? Manage lifecycle of each server using AWS Hooks and Standby (released July 30th 2014) Pending Pending:Wait (Lifecycle hook) Terminated State diagram Terminating:Wait (Lifecycle Hook) InService Standby 9 Claudio Pontili | AWS On-Demand 03/04/2025

  10. Task B: Auto-Scaling HTCondor using AWS - 2 Hook queue Custom Action: finding the right VM to terminate and changing ASG state to standby Queue Custom Action: looking for standby instance instead of creating a new one SNS SNS Lifecycle Hook Java Java Scale Down Event Scale Up Event Auto Scaling Group and ELB controlled by Custom Metrics Instance Removed from the Auto Scaling Group and ELB Instance attached to Auto Scaling Group and ELB Standby Instance (itll be terminated after 5 days) Instance Launched Permissions Custom Metrics (Idle and running jobs) Role Based Authentication AWS CLI S3 Logs and Init Scripts At least 7 different Amazon services 2 different programming languages (Java and bash scripting) Role based authentication: auto-generating and auto-rotating logins and passwords 10 Claudio Pontili | AWS On-Demand 03/04/2025

  11. Task C: Hybrid cloud Fermicloud and Amazon Web Service No change for the final user, who just wants to deploy a job in the same old way and compute it as fast as possible Now we can handle spikes of traffic using commercial cloud We pay AWS only during spikes 11 Claudio Pontili | AWS On-Demand 03/04/2025

  12. Conclusions Experiments have an increased need for computing resources with an increased diversity of requirements Managing these needs (especially peak demand) is a major focus Experiments are being enabled to use a diverse set of resources: Local, Grid, and Cloud Need to demonstrate sustainability and cost effectiveness of the commercial Cloud solution for physics use cases (e.g. NOvA MonteCarlo) This work demonstrated the scaling of on-demand services in support of scientific workflows using native Amazon Services 12 Claudio Pontili | AWS On-Demand 03/04/2025

  13. Thank you Questions? 13 Claudio Pontili | AWS On-Demand 03/04/2025

More Related Content