Stability and Challenges in IT Infrastructure - Oslo Site Report 2024

1 / 8

Embed Share

Explore the journey of IT infrastructure at Oslo site in 2024, highlighting stable operations, budget constraints, compute issues, disk and tape storage developments, and network upgrades. Experience the highs and lows of maintaining a robust computing environment.

jveron Follow

Uploaded on Apr 03, 2025 | 0 Views

Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.

Download Presentation

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.

E N D

Presentation Transcript

Oslo site report AHM 2024-1 - Copenhagen

All good and stable (large picture) in compute - - - Same setup as since 2020 Used up the budget for this period (as planned) Thought we had service agreement on compute nodes end 2026 - but it turns out we do not. Either some miscommunication (although I got it confirmed several times) - or I do not know Anyway we hope the hardware lasts until end of period Plans: will recreate the cluster before summer going from centos 7 -> almalinux 9 -

Compute Been a somewhat bumpy ride since around Nov. SLURM exporter Red: Nodes down Yellow: Nodes drain November April

Various issues causing low delivery Had a period with lots of issues with iowait causing nodes to hang and go down in SLURM. Some mitigations done on NREC side related to iops limit enforcement to prevent other iops hungry processes to clog the system. Also had my SLURM healthcheck script running for a while, think that worsened the situation. Off again. Have finally noticed that some nodes are drained and stay drained in SLURM - not really sure why they do not get new jobs. Resumed them now.

One year One month

Disk New storage has been a disaster since installation. Originally, we believed it was the storage. We learned it s more likely the NICs. We have started developing dCache on-site at UiO and have a face-to-face planned with Tigran 8-May to 10-May where he will help get a on-device development environment up to allow us to run non-destructive tests and profiling of new code. We have improved write performance of dCache on early testing substantially by eliminating the idiotic 8KB transferred per system call limit imposed by Java. Between Tigran and other DESY resources, we expect to get good enough performance for production soon.

Tape WE HAVE TAPE!!! YES WE DO!!! 11PB (7PB OF LOVELY BRAND NEW TAPE!) It looks like we re finally not the bad guys. The tape is filling and we re happy! Visiting a trade show in Berlin 7-May where I ll look at Huawei 100-layer, BluRay 200TB disc system as an alternative supplier for archive storage.

Network Things are moving slow on NAIC, but moving. Currently, we have 100Gb networking in the CERN rack. Adding redundancy (in rack, not Uninett), looking to upgrade to 400Gb

Stability and Challenges in IT Infrastructure - Oslo Site Report 2024

Download Presentation

Presentation Transcript

Related

More Related Content