
Evaluating Kubernetes as an Orchestrator for ATLAS Experiment Event Filter Farm
"Explore the use of Kubernetes as an orchestrator for the Event Filter Farm of the ATLAS Experiment at the LHC. Learn about the functionalities, features, and benefits of using Kubernetes in this context. Find insights from the evaluation process and scaling tests conducted. Presented at CHEP 2018. #ATLASExperiment #KubernetesOrchestrator #EventFilterFarm #LHCTriggers"
Download Presentation

Please find below an Image/Link to download the presentation.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.
You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.
E N D
Presentation Transcript
Evaluating Kubernetes as an Orchestrator of the Event Filter Farm of the Trigger and Data Acquisition System of the ATLAS Experiment at the LHC GIUSEPPE AVOLIO CERN MATTIA CADEDDU CERN (ON LEAVE) REINER HAUSER MICHIGAN STATE UNIVERSITY
Outline The ATLAS Trigger and Data Acquisition (TDAQ) system for the High Luminosity LHC (HL-LHC) era Why Kubernetes? Kubernetes functionality and features Evaluating Kubernetes as an orchestrator for the Event Filter (EF) farm Running EF processes in Docker container Scaling tests Conclusions 2 09/07/2018 CHEP 2018 - SOFIA
Roadmap to HL-LHC LHC is now in the last year of Run 2 operations Peak luminosity 2 x 1034cm-2s-1 More than 60 interactions per bunch crossing HL-LHC will push the limit much higher Luminosity up to 7 x 1034cm-2s-1 More than 200 interactions per bunch crossing The data acquisition system has to cope with the higher luminosity 3 09/07/2018 CHEP 2018 - SOFIA
The ATLAS TDAQ System for HL-LHC The system has to sustain high rates Input data rate is 1 MHz 10 times more than Run 2 Event size is about 5 MB 4 times with respect to Run 2 See plenary S4: ATLAS and CMS Trigger and Data Acquisition Upgrades for the High Luminosity LHC Highly distributed system Tens of thousands of applications to supervise Large IT infrastructure The Event Filter farm only will consist of more than 3000 computing nodes The Storage Handler buffers data received from the read-out system to decouple the read-out and the Event Filter (more than one hour of event buffering) 4 09/07/2018 CHEP 2018 - SOFIA
Why an Orchestrator? Operating the EF Farm Kubernetes The presence of the Storage Handler allows to operate the EF farm in different manners Decoupled or not from the LHC cycles Prompt or delayed processing Mixed work-loads (e.g., Monte Carlo production) A robust and reliable mechanism for the management of all processes running in the EF farm is a requirement to guarantee stable and effective execution of the EF service Support to different application life-cycles Flexible scheduling and easy scaling of applications Dynamic handling of cluster resources Scaling to thousands of hosts Support for several storage back-ends Containerized applications 5 09/07/2018 CHEP 2018 - SOFIA
Organizational primitives Kubernetes Kubernetes An open-source system for automating deployment, scaling, and management of containerized applications Storage Scaling Orchestration Scheduling Upgrades Health Checking Service Discovery https://kubernetes.io/ 6 09/07/2018 CHEP 2018 - SOFIA
Evaluating Kubernetes Configuration Performed Tests Kubernetes version 1.5 CERN IT virtual infrastructure Cluster with 1000 virtual cores 1 master node 32 cores 60 GB RAM 240 slave nodes 4 cores 8 GB RAM Execute EF processing units in containers with Kubernetes Measure the time needed to fully populate the cluster for different cluster size different number of per-host instances of the same container Study the impact of the Kubernetes QPS (Query per Second) parameter set 7 09/07/2018 CHEP 2018 - SOFIA
The QPS Parameter Set Several Kubernetes modules expose some configurable QPS parameters Mainly configuring the interaction with the API server Component Parameters (def. values) event-qps (5) kube-api-qps (5) event-burst (10) kube-api-burst (10) kubelet QPS tuning not really documented Some sparse information from few sources available on the web Digging into command line parameters of Kubernetes components kube-api-qps (20) kube-api-burst (30) kube-controller-manager kube-proxy kube-api-qps (5) Approach Scale default values with some fixed multipliers kube-api-qps (50) kube-api-burst (100) kube-scheduler 8 09/07/2018 CHEP 2018 - SOFIA
EF Processing Units in Kubernetes Emulating EF processing units with the offline version of today s filtering software (AthenaHLT) Kubernetes Volume Abstraction Docker container Base Scientific Linux CERN 6 (SLC6) OS image Few additional packages installed Mounts in Containers Software retrieved from the CERN VM File System (CVMFS) repository Storage volume technology abstracted via a FlexVolume driver developed at CERN Simulating data processing Input storage area with data files Output storage area with processing results 9 26/06/2017 TDAQ PHASE-II WORKSHOP
Data extrapolation results in almost 9 minutes to populate a 2000 host cluster (size of the Run II farm) Current system in ATLAS takes O(10) seconds 80 70 60 50 Kubernetes Tests Time (s) 40 Time to fully populate the cluster (using the Google pause container) 30 Container replicas: 5 per host 20 QPS: variable Cluster size: variable 10 0 0 50 100 150 200 250 300 Cluster Size QPS x1 QPS x2 QPS x3 QPS x4 10 09/07/2018 CHEP 2018 - SOFIA
Few outliers beyond the 95thpercentile No dependency on specific hosts 1400 1200 1000 # Started Applications Kubernetes Tests 800 Time profile of started containers 600 Container replicas: 5 per host 400 QPS: variable Cluster size: 240 200 Faster start-up 0 0 10 20 30 40 50 60 70 80 Time (s) QPS x1 QPS x2 QPS x3 QPS x4 11 09/07/2018 CHEP 2018 - SOFIA
80 70 From 20 (QPS x1) to 70 (QPS x4) containers per second 60 Deployment Rate (s-1) Kubernetes Tests 50 Container deployment sustained rate 40 Container replicas: 5 per host 30 QPS: variable 20 Cluster size: 240 10 0 0 1 2 3 4 5 QPS Multiplier 12 09/07/2018 CHEP 2018 - SOFIA
30 25 20 Time (s) Kubernetes Tests Kubernetes seems to prefer larger clusters with less applications per host 15 Time to fully populate the cluster (using Google pause container) 10 Container replicas: variable 5 QPS: x4 Cluster size: variable 0 0 50 100 150 200 250 300 Cluster Size 1 Replica 3 Replicas 5 Replicas Extrapolating the obtained results to the Phase-II scenario (and excluding higher order effects for larger cluster Kubernetes officially supports 5000 host clusters), the EF cluster (3000 hosts) will be fully populated with one processing unit instance on each host in about 35 seconds 13 09/07/2018 CHEP 2018 - SOFIA
Conclusions & Outlook Kubernetes provides a reach feature set Easy scaling Flexible scheduling Native support to several storage back-ends and sufficient performances to be used as an orchestrator of the EF computing farm Fully populating a 3000 host cluster in about 35 seconds Performance is highly dominated by the QPS parameter set tuning Several parameters in various Kubernetes modules (kubelet, controller manager, proxy, scheduler) Reached a sustained container deployment rate much higher than the out-of-the-box configuration From 20 to 70 containers per second for QPS values four times bigger than the default configuration Keep monitoring upcoming Kubernetes releases in order to track and verify evolving performance figures and new introduced features 14 09/07/2018 CHEP 2018 - SOFIA