
Innovations in Data Storage for Large-Scale Research Projects
Explore the latest advancements in data storage technology for large-scale research projects as discussed by Alastair Dewhurst. Topics include disk storage, tape storage, storage costs, storage software, and the concept of data lakes. Learn about the shift towards SSDs, the resilience of tape storage, managing storage software, and the development of data lakes by the WLCG.
Download Presentation

Please find below an Image/Link to download the presentation.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.
You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.
E N D
Presentation Transcript
Data Storage Alastair Dewhurst
2 Introduction Disk Storage [EB] 0.5 1.5 2.5 3.5 4.5 4 2 3 5 1 2020 2022 2024 2026 2028 2030 2032 2034 As has almost certainly already been shown, the LHC models predict significant growth in data. Even with aggressive R&D this will be at the edge of what Moore s law can provide. We can learn a lot from industry. Amazon, Google etc. store many Exabytes already. 2020 Computing Model - Disk ATLASPreliminary LHCC common scenario Sustained budget model Aggressive R&D Conservative R&D Baseline (Conservative R&D, (+10% +20% capacity/year) Run 3 ( m =55) m =200) Run 4 ( m =88-140) Run 5 ( m =165-200) Year Tape Storage [EB] 4 7 2 3 5 6 8 1 2020 2022 2024 2026 2028 2030 2032 2034 2020 Computing Model - Tape ATLASPreliminary LHCC common scenario Sustained budget model Tier-1 Aggressive R&D Tier-1 Conservative R&D Tier-1 Baseline (Conservative R&D, (+10% +20% capacity/year) Run 3 ( m =55) m =200) Run 4 ( m =88-140) Run 5 ( m =165-200) Year Alastair Dewhurst, 19th July 2021
3 Disk Roadmap SSD are taking over in the consumer world and for data intensive workflows. Focus on performance rather than capacity. HDD remains critical for data centre use cases. Data does not in general get deleted, so larger fractions are becoming cold . Clear roadmap for higher capacity HDD. Alastair Dewhurst, 19th July 2021
4 Tape Roadmap Oracle unexpectedly pulled out of the market in 2017. Tape has been declared dead many times, but development continues at a rapid rate. Tape has a few strong selling points: No power costs to store data. Tape media lasts a long time (~30 years) New: The air gap means it is immune to ransomware attacks! In December 2020, IBM demonstrated a 580TB Tape. http://www.insic.org/wp-content/uploads/2019/07/INSIC-Technology-Roadmap-2019.pdf Alastair Dewhurst, 19th July 2021
5 Storage costs This year I was quoted ~10 times the cost for SSD compared to HDD storage. Tape is ~1/3 the cost of HDD. I believe all 3 technologies will be vital for HL-LHC. https://aip.scitation.org/doi/10.1063/1.5130404 Alastair Dewhurst, 19th July 2021
Storage software Two things to consider: Managing the underlying storage Managing the middleware Grid tools were designed in a different era. Slowly the WLCG is replacing bespoke Grid software with industry standard tools. Industry Standard API Grid Layer Grid Layer A collection of storage servers managed by site admins and HEP community scripts. Large scale industry standard storage endpoints to manage hardware. Alastair Dewhurst, 19th July 2021
7 Data Lakes The WLCG is working on a Data Lake concept. I would expect fewer larger scale storage endpoints. Data could be accessed directly from these endpoints. Numerous small performant caches (SSD) for data processing. Experiments and Resource providers need to work together to get data in the right place before it is needed. Alastair Dewhurst, 19th July 2021
8 UK Strengths The GridPP project has been in existence for 20 years. Seen as a very reliable partner from an operations point of view. We have many people in leadership positions within the WLCG community. UK has pioneered the use of Erasure Coding for data storage. UK has additional funding from EGI and Swift-HEP to run / develop FTS, Rucio and CVMFS services. Alastair Dewhurst, 19th July 2021