
Juno Computing Status at IHEP - Overview and Resources
"Learn about the latest computing and storage resources at IHEP for Juno project, including job statistics, GPU cluster, Lustre and EOS storage, and data transfer system. Explore the advancements and challenges in the computing infrastructure. Stay updated on the progress and future roadmap."
Download Presentation

Please find below an Image/Link to download the presentation.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.
You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.
E N D
Presentation Transcript
JUNO Computing Status at IHEP Yujiang Bi On Behalf of IHEP-CC 4/4/2025
Outline Overview Computing, Storage, Network, Public Services EOS Evaluation Testbed, QuarkDB, Performance, EOS SE EOSCTA Progress Architecture, Testbed, Problems Roadmap Summary Yujiang Bi, Computing Center Juno Computing Status - Outline 4/4/2025 2
Computing Resources HTCondor Cluster JUNO Job Statistics: 2020.07~2021.01 8.9M jobs, 10.7M CPU hours. 40%+ extra CPU hours from other experiments. Benefits from computing resource sharing policy Customized Job walltime Limit for JUNO. Query: hep_clus -g juno --walltime 2020 Summer Maintenance All login/worker nodes upgraded to CentOS7 Old systems (SL 5/6/7) served as container. Seeing user manual Scheduler server expanded to release pressure brought by other experiments 448 CPUs Cores added in 2021/01. Job Type Test Short Default Mid & Resource Limit <5 mins <0.5 h < 20 h <100 h ; <10% of total resources Juno Computing Status - Overview 4/4/2025 3 Yujiang Bi, Computing Center
Computing Resources Slurm GPU Cluster % OF JOBS Slurm upgraded to 19.05.6 Shared by 6 groups, including junogpu. 182 NVidia V100 cards, 8 GPUs/node. 19 active users, 9 new. JUNO Job Statistics: 2020.07~2021.01 17,986 jobs (11%) 53,052 GPU hours ( 11%) mlgpu lqcd junogpu higgs gpupwa bldesign 4%3% 0% 11% 0% 82% % OF CONSUMED GPU*HOURS # of JUNO GPU Jobs per Month GPU*Hours of JUNO Jobs per Month mlgpu lqcd junogpu higgs gpupwa bldesign 4500 16000 3963 3960 14232.6622 4000 14000 12637.4147 3335 1% 5% 3500 12000 2835 9863.205 3000 10000 26% 2500 8000 2000 1511 5068.995 6000 4940.46 1408 4213.0553 1500 974 0% 4000 1000 2096.8044 2000 11% 57% 500 0 0 2020.07 2020.08 2020.09 2020.10 2020.11 2020.12 2021.01 2020.07 2020.08 2020.09 2020.10 2020.11 2020.12 2021.01 Juno Computing Status - Overview 4/4/2025 4 Yujiang Bi, Computing Center
Storage Resources 2500 Lustre - junofs 1.9 PB intotal, 1.5PB used, 59M files ~ 600TB added in Nov. 2020 EOS junoeos Ready for user test since 2020/08 980TB in total, 732TB used, 1.2M files Castor 47 TB in total, no new data backup to tape in 2020 Data stored with 2 copies on LTO7 tapes /castor/ihep.ac.cn/juno/prototype, 29TB in total Data stored with 1 copy on LTO4 tapes /castor/ihep.ac.cn/juno/PmtCharacterization, 18TB in total Planned to migrate to LTO7 tapes 2000 1500 1000 500 0 /junofs /eos /scratchfs /workfs Used(TB) Available(TB) Yujiang Bi, Computing Center Juno Computing Status - Overview 4/4/2025 5
Data Transfer System Data Transfer Effect Modular Architecture Design Sentry High Availability is considered. Deployed in Zhongshan and IHEP. 31 TB data transferred. Works well so far. JUNO-PMT Find new data, put into local buffer & DB Configuration Service and policy interface Monitoring GUI to show status and statistics Logging Realtime & historic logs Transfer Sender & Receiver Computer Center in IHEP Transfer Server Master Transfer Server (Backup) DAQ RAID Transfer Server Lustre Transmission Traffic Diagram
On-site Facilities On-site Computer Room Completed, acceptance passed, and ready for service. Network Progress Network plan optimized, and network service will cover all buildings. Equipment under procurement, expected to finish in late Feb. 2021. Temporary internet: 50Mbps, and ready in late Mar. 2021. Generic Cabling The Campus: under layering, ready in late 2021. Computer Room: cabling and labeling ready in early Mar. 2021. Juno Computing Status - Overview 4/4/2025 7
JUNO Storage Plan JUNO Main Filesystem Current filesystem is Lustre (/junofs) EOS works better with XRootD and ROOT Seamless integration with HEP workflow EOS Good Enough for JUNO? EOS I/O System Performance Application Performance with EOS To replace Lustre and as main filesystem? EOS Status @ CERN Main filesystems for experiments and public services. 6 production instances, manage 300PB + LHC data. Yujiang Bi, Computing Center Juno Computing Status - EOS 4/4/2025 8
EOS Status@IHEP EOS Current Architecture MGM management service FST file storage service QuarkDB metadata service MQ, Sync, Fed Deployment Status 5 instances including junoeos, manages 10PB+ data. EOS for LHAASO Main storage filesystem, run for 4+ years steadily. Peak I/O: read ~ 46GB/s, write ~ 28GB/s. No fuse/fusex (no /eos mountpoint) on worker nodes. New Lustre (/lhaasofs) for user scripts, job logs and non-root data (Cosika) Yujiang Bi, Computing Center Juno Computing Status - EOS 4/4/2025 9
QuarkDB New Namespace for EOS. Based on RocksDB and XRootD protocol All EOS@CERN instances are upgraded to QuarkDB Old NS deprecated, and QuarkDB required since 4.8.28 SSD is preferred for database storage Advantages No RAM limitation, resident on disks, flexible to expand. MGM instant reboot (1s-10s, hours for old NS) Easy to backup database or restore from backup. Highly available(thanks to raft algorithm) and reliable. Architecture 3 nodes at least 1 leader and several followers
QuarkDB Evaluation & Migration QuarkDB Evaluation Purpose: performance, stability, backup Setup 3 quarkdb nodes: qdb[1-3] 4 eos nodes: eostest[01-04] 1 JBOD with 54 HDDs Scheme Continuously writing small files to EOS. QuarkDB status under abnormal cases. Migration for LHAASO/HXMT Progress 1. Metadata conversion to QuarkDB 2. Configure EOS to use QuarkDB as NS 3. Upgrade EOS to 4.7.7 and restart EOS Service Status 5 months long and no service restart. Stable enough for production. Evaluation Result Writing performance: Before: 1K files/s After: 800 file/s Boot time (1.2M files): Before: 16 mins After: < 1s Abnormal test: 2 nodes to work QuarkDB leader election is fast EOS not sensitive to leader transition Backup & restore raft-checkpoint to create backup rsync backup to other places replace DB with backup to restore easy backup and restore Yujiang Bi, Computing Center Juno Computing Status - EOS 4/4/2025 11
JUNO EOS Availability Ready for user test since 2020/08. Hardware 4 nodes, 2 disk arrays. Software EOS: 4.8.25 QuarkDB: 0.4.2 1 MGM, 4 FST nodes, 84 FSs in total, 3 QuarkDB nodes. Usage Pure XRootD protocol and EOS command to access EOS No fuse or fusex mountpoint /eos on login/worker nodes Monitoring Integerited with unified ganglia and nagios monitoring systems. 4/4/2025 12
Problems & Solutions Problems Unable to store file, file incomplete or file lost. Client disconnected from FST because of connection timeout. Solution 1. Upgrade client XRootD to a newer version. 2. Upgrade EOS to 4.8.25. 3. Adding XRD_STREAMTIMEOUT=600 into job scripts. seeing Unable to store file - file has been cleaned because of a client disconnect 4. EOS > 4.8.31 solved this problem in server side. Setting EOS_FST_ASYNC_CLOSE=1 in EOS sysconfig file. Yujiang Bi, Computing Center Juno Computing Status - EOS 4/4/2025 13
EOS I/O Performance - Application Purpose Evaluate EOS if meets JUNO offline I/O needs and comparable to Lustre. Test Setup Two I/O patterns. Direct read from or write into Lustre/EOS Use an temporary storage as intermediate storage. Offline software Detector simulation - desim Electronic simulation elecsim Test Tool DataCollSvc Job walltime, and occupied RAM. Measure time of reading/writing a file. 500 jobs, repeated 10 times in each job. From Sicheng Yuan s report Yujiang Bi, Computing Center Juno Computing Status - EOS 4/4/2025 14
EOS I/O Performance - Application Test Result Job whole walltime Elecsim: EOS is slight longer. Read walltime detsim data: no obvious difference Write walltime Cost time: EOS > tmp > Lustre Files stored with 2 copies in EOS and 1 in Lustre Copy into EOS/Lustre EOS is faster for large files, and Lustre for small files. From Sicheng Yuan s report Summary EOS can manage heavy I/O scenarios (R: ~46GB/s,W: ~28GB/s) No big performance gap between EOS and Lustre EOS Should meet the need of JUNO computing More user application performance test needed Yujiang Bi, Computing Center Juno Computing Status - EOS 4/4/2025 15
EOS SE for JUNO Current Status EOS with gridftp works (GSI auth) (eosgrid.ihep.ac.cn for belle in 2020) Todo SRM management Integrated with Dirac TPC support? Plan Ready in the middle of 2021 EOS SE Distributed computing and storage GSI Authentication gridftp support SRM management Dirac Support EOS SE Testbed Hardware SE node: junoeosse EOS nodes: junoeos[01-04]
EOS+CTA CTA (CERN Tape Archive) - Designed as the tape back-end of EOS. EOS provides user interface, file operations and namespace. CTA provides highly performant tape operations and management. A CTA can act as tape backend for several EOSCTA instances. EOS+CTA Architecture Big EOS or other Filesystem Data transferred from/to Little EOS One-to-one corresponding with Little EOS Little EOS Receive data from big EOS and copy to tapes Retrieve data from tape and transfer to big EOS CTA Manage queues for data archive or retrieval EOSCTA@CERN Read for production. Atlas, CMS, Alice migrated, LHCb and public to migrate in 2021. Archive Archive Archive Archive Recall Retrive Retrive Recall Big EOS B Central CTA Little EOS B Little EOS A Little EOS A
EOS+CTA Testbed @ IHEP Testbed Setup Hardware 3 VMs ctaeos[01-03] Virtual tape Library Software CTA : 3.1-14 EOS : 4.8.34. QuarkDB: 0.4.3 PostgreSQL: 9.6 mhVTL: 1.5.3 CTA Components CTA frontend daemon Tape daemon CTA Cli Object Store Database Problems Authentication Kerberos is required for cta-frontend Object Store File Object Store, instead of Ceph Transfer System XRootD or EOS command. No transfer system. Juno Computing Status - EOSCTA Yujiang Bi, Computing Center 4/4/2025 18
EOS/CTA Roadmap EOSCTA EOS Kerberos Authentication Feb. 2021 Ceph as Object Store Mar. 2021 Transfer System FTS service(?) Ready for Production June. 2021 Production Deployment To be confirmed SoftwareUpgrade EOS/QuarkDB upgrades to latest stable version. Dedicated Servers for QuarkDB Independent from EOS services SSD raid for database storage. Ready for production Late 2021 More application tests needed EOS SE Middle of 2021.
Summary Computing Facilities Progress Well in Spite of COVID-19 System and software upgrade during summer maintenance HTCondor, Slurm, EOS, QuarkDB On-site campus facilities construction is proceeding as planned On-site computer Room ready for service Network and Cabling for Computer Room ready in late Mar. 2021 EOS/EOS SE Ready for Production in 2021 QuarkDB is remarkable stable for production Hardware and software upgrade needed for QuarkDB EOS I/O performance for application is acceptable EOS SE for JUNO deployed and under evaluation EOSCTA for Production Deployment in 2021 Testbed deployed in VMs ready and under evaluation Yujiang Bi, Computing Center Juno Computing Status - Summary 4/4/2025 20
Thank You! Questions ? Yujiang Bi, Computing Center Juno Computing Status - QA 4/4/2025 21
EOSCTA @ CERN Ready for Production Migration from Castor to EOSCTA tested. Just metadata migration, tape unchanged. Migration started since 2020. PostgreSQL to replace MySQL, no dependent on Oracle license. Data Flow Diagram DAQ -> Big EOS -> FTS -> Little EOS -> CTA Migration Progress ATLAS : Jan-Jun 2020. In Production. Alice: Jul-Oct 2020. In Production. CMS: Oct-Dec 2020. In Production. LHCb: Q1 2021. Public: 2021. Yujiang Bi, Computing Center Juno Computing Status - EOSCTA 4/4/2025 22
Monitoring System OMAT System Open Maintenance and Analysis Toolbox developed at IHEP. DeploymentMigration Migrated from physical cluster to k8s cluster. ES cluster expanded from 7 nodes to 33 nodes. More stable and more reliable. Common Query interface Ready Remote Site Monitoring Provides customized data collection functions. Design evaluation strategy of site maintenance quality. Roadmap Analyze monitoring data of remote sites, and provide site operation and maintenance quality ranking. Yujiang Bi, Computing Center Juno Computing Status - Overview 4/4/2025 23