Data I/O at Edge Sites for Traditional Experiments in a Distributed Environment
Lightweight distributed computing platform for traditional experiments with challenges in data management, user authentication, and I/O efficiency between edge sites and IHEP. Solutions include mimic file systems, unified authentication, and asynchronous data write back.
Download Presentation
Please find below an Image/Link to download the presentation.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.
You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.
E N D
Presentation Transcript
Data I/O at Edge Sites for Traditional Experiments in a Distributed Environment Lu Wang (wanglu@ihep.ac.cn) On behalf of Computing Center, IHEP 2022-Mar-25
Outline 1 Introduction 2 Design and Initial Implementations 3 Test results 4 Summary & Next steps 2
One Platform, Multi-Centers (1/2) One Platform, Multi-Centers is a lightweight distributed computing platform Computing resources are contributed by IHEP and its collaborative institutes Both the IHEP and edge sites are maintained by manpower of IHEP-CC On edge sites ( Before) User accounts, file systems namespace, job schedulers are isolated from IHEP-CC On edge sites( After) Job slots are shared by local and pilot jobs Authenticate both local and distributed users Jobs from IHEP can be transparently executed 3 ISGC 2022
One Platform, Multi-Centers (2/2) Since 2020, resource of edge sites have developed to a significant scale CPU cores at edge site has summed up to a larger number than IHEP Storages on edge sites are limited For most experiment has only one complete data replica at IHEP Effective, transparent data I/O from edge sites to IHEP is crucial for the overall efficiency 4 ISGC 2022
Motivations and Challenges For modern experiments, technology of WLCG has been fully reused For traditional experiments which have been running for decades, challenges include: dependency on POSIX File I/O interface >10 PB legacy data on IHEP distributed file system does not support distributed authentication, data servers are not accessible on edge sites Lack of distributed data management system Users prefer waiting on local cluster queues than changing their job options from time to time Effective, transparent remote I/O on edge site requires: A mimic file system which provides both POSIX interface and data proxy Unified user authentication and mapping Read cache and write buffer to cover latency and bandwidth limits of WAN Asynchronous data write back to IHEP central file system 5 ISGC 2022
Useful I/O patterns and environment conditions I/O patterns can be leveraged Most of HEP data is Write Once Read Many read cache for public input dataset such as random triggers will help performance For background jobs, especially production jobs, the write mode is N to N it is very rare that more than one job will open the same file for modification simultaneously restriction on consistency for write operations can be loose Environments can be leveraged Most edge sites have a shared file system accessible on computing nodes Provide site-wide consistency Good network connection between edge site and IHEP Latency ~40ms , bandwidth [1G, 10G] 6 ISGC 2022
Overall Architecture XRootD cluster at IHEP Exports IHEP Lustre file system to the remote sites, authentication and permission check XCache cluster on edge site Accelerates read-only data access Cache file system Write data to local distributed file system Read data from XCache cluster Shared file system on remote site Provides shared disk storage for XCache & the Cache file systems on computing node Site-Wide data consistency Authentication Domain of One platform, multi Centers Data transferring system Upload new files asynchronously to IHEP XRootD cluster 7 ISGC 2022
The Cache file system The Cache file system co-operates all the components together Since we reused XRootd family members, this file system is developed based on the FUSE file system xrootdfs Additional developments is in need: Token based authentication and communication Multiple I/O route & switch of I/O route based on the open flag Metadata merge on directory operations Insert request to data transferring system on file close 8 ISGC 2022
Authentication and authorization User identity token is attached with job scripts by d-HTC scheduler Jobs on edge sites has a random local uid/gid the token is stored on standard location on computing node Owner of files in cachefs is the random local uid/gid the token information is extracted by cachefs and attached in url of xrootd request The multiuser plugin on the XRootd server at IHEP Parse the token information Authenticates with IHEP Kerberos server Get the real uid/gid Make I/O to Lustre with identity of the real uid/gid Assumption: all the users have a local user account at IHEP 9 ISGC 2022
Proof of Concept Small modifications of xrootdfs were made for the proof of concept Read-only operations send to XCache at edge site, others to IHEP XrootD Since the original file path is mimicked by xrootdfs, no change of job option file is needed No token related function is implemented Tests on CSNS site Sequential and stride read Sequential write BESIII simulation and reconstruction jobs Shared file system: Huawei Oceanstor 9000 # xrootdfs o readurl=xcache.edgesite -o writeurl=xrootd.ihep mountpoint 10 ISGC 2022
Test results : Sequential Read With Xcache, second time sequential read throughput can catch up with IHEP Lustre Xrootdfs introduces a significant performance overhead ~2x 11 ISGC 2022
Test results : Stride Read Performance gain of XCache for second read is more significant The overhead of Xrootdfs is more significant, >10x 12 ISGC 2022
Test results: Sequential Write Command 1 : dd if=/dev/zero of=testfile bs=128K count=8000 Command 2: xrdcp testfile root://ihepxrootd://testfile In extreme situation, Performance lost introduced by WAN is about 10x Performance lost introduced by xrootdfs is about 20x Performance optimization through local buffer is very necessary before production usage 13 ISGC 2022
Test result: BESIII job Simulation job Generates 50K JPSI events Input: libraries in CVMFS Output: one 300 MB binary file+ some auxiliary files Reconstruction job Input: the simulation output, libraries in CVMFS,2 GB random trigger file Output: a single 800 MB dst file and some auxiliary files all the jobs were executed on single test node and finished successfully Little performance difference between local and remote jobs the I/O requirement for single job is small To reflect the true scenario, a larger test scale is in need 14 ISGC 2022
Summary & Roadmap We proposed a solution based on XRootD to transparent POSIX file data access on edge sites to IHEP Initial test results shown the feasibility and necessity of our design Several functional and performance related developments and iterations are on the road map I/O route selection & merge Data Token Persistent Write Buffer Stress test transferring System Authentication The fuse module of EOS is a good reference our project emphasis on remote background job rather than local interactive access on log in node 15 ISGC 2022
Questions & Thanks ISGC 2022 16