
Leadership Computing Facilities: Infrastructure Discussion at ACME All Hands Meeting
"Explore the discussions on machines, availability, turnaround times, and allocation policies at the Leadership Computing Facilities infrastructure meeting for ACME. Learn about the computational resources, priorities, and future strategies discussed during the event."
Download Presentation

Please find below an Image/Link to download the presentation.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.
You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.
E N D
Presentation Transcript
Leadership Computing Facilities (LCF) Infrastructure Discussion Mark Taylor ACME All Hands Meeting June 9, 2016 Rockville
Discussion Topics Machines Availability Turnaround Low-res and high-res runs Problems / Solutions / Strategies
Computational Resources Search computational resources on Confluence Overview of ACME machines (Edison/Titan/Mira) Overview of ACME allocations (INCITE/ALCC/ERCAP) Instructions for getting an account Allocation prioritization policy
Mira 49K nodes 8K nodes minimum (512K cores) needed for priority, throughput and access to 24h que. ACME v0 high-res can use 2K nodes Bundle 4 member ensembles. 4x80 years in CY15 & 16 ~1M core-hours per year Performance group is working on a v1 high-res configuration that can scale to 8K nodes Allocation: INCITE CY16: 100M core-hours ALCC (through mid 2017): 158M core-hours
Titan 19K nodes 3750 nodes minimum (60K cores) needed for priority, throughput and access to 24h que. ACME v0 high-res works well CY15 used our entire allocation in 3 months running v0 pre- industrial 1.8M core-hours per year Allocation: INCITE CY16: 80M core-hours ALCC (through mid 2017): 53M core-hours
NERSC: Edison/Cori 5600 nodes Supports small and large jobs. Only machine appropriate for low-res runs ACME v1 low-res: 150-300 node jobs are efficient, but with long que wait times (48h) Can use premium que but costs 2x (24h) Premium benefit also available for ~600 node jobs Working with NERSC to get a semi-realtime que for 150 node jobs, 12h, run over night. (~15M core-hours per year) Allocation: ERCAP: CY16: 105M (used 73M to date)
NERSC: Cori Phase 2 Available October free through NESAP early access if we can run! Eventually expect performance slightly faster than Edison using same number of nodes Initial performance could be much slower as we work out details of porting & using KNL
ACME condo-computing Considering condo-computing models at NERSC, PNNL, ANL and ORNL 150 nodes Equivalent of 30M core-hours on Edison Delivery would be ~ late 2016
Comments NERSC semi-realtime que could provide 2-5 years overnight, every day Remaining runs need to adopt various que strategies Bundle 2 runs to get in the 600 node que Improve ACME v1 scaling so 1 run can use 600 nodes Submitting many runs at once for sets of runs that can be done independently Real World: Many reliability problems at NERSC. Random slowdowns, hangs or crashes