Efficient Cluster Analysis for Scientific Payloads in Distributed Computing

cluster analysis of scientific payload to execute n.w

1 / 10

Embed Share

Explore cluster analysis of scientific payloads to optimize performance in distributed computing environments. Hypothesize common features impact algorithm performance and consider various payload types. Dive into WLCG task processing logs and descriptive parameters of payloads to enhance data analysis. Find insights on determining the number of clusters efficiently using different methods.

cnik Follow

Uploaded on Jul 16, 2025 | 0 Views

Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.

Download Presentation

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.

E N D

Presentation Transcript

Cluster analysis of scientific payload to execute it efficiently in distributed computing environment Maksim Gubin, Tomsk Polytechnic University Maria Grigoryeva, National Research Centre Kurchatov Institute Mikhail Titov, National Research Centre Kurchatov Institute

Hypothesis Scientific payloads produced by modern scientific experiments have common features that cause clustering algorithms show different performance on such payloads than on typical datasets used for performance evaluation of clustering algorithms; Re-evaluating clustering algorithms performance on sample of scientific payloads can lead to improved set of recommendations regarding use of clustering algorithms in data analysis for large scientific experiments. 2

Payload types under consideration Time series (event logs where each record has timestamp, event type and domain-specific event parameters); Internet traffic logs (source, destination, data amount, port, protocol); Volatile states (records describing a current state of an entity under consideration) Computing grid payload processing logs (payload type, processing start time, processing duration, CPU load, memory usage, disk IO); 3

Payloads WLCG task processing logs Hundreds of thousands records daily; Extreme amount of details; Domain-specific parameters; Hard to meaningfully expand with additional sources; TPU traffic records Time series; Over a million records daily; Significant possibility of enrichment; Minimal amount of details. 4

Descriptive parameters of payloads Traffic data Source; Destination; Data size; User; Source location; Event in source location during data transfer; Scientific payload Site; Creator; Core count; Task type; Processing type; Task priority; Input dataset size; Allocated resources; 5

Determining number of clusters Elbow method Requires creation of a set of models with different number of clusters; Does not always give decisive answer; Can be automated, but at the cost of precision. Hierarchical clustering Requires creation of hierarchical model; Can be off significantly from the elbow method estimate; Is faster to compute than using elbow method. 6

Impact of data preprocessing on clustering performance Encoding categorical features causes severe drop in clustering performance due to increasing the volume of data; (in the test cases increase in clustering duration of up to 10 times was observed; on average, model took 2.1 longer to train with categorical features in the test cases) 7

Impact of data preprocessing on clustering quality Using categorical features causes deterioration in precision and recall regardless of preprocessing; Logarithmic normalization of duration-type parameters improves both precision and recall in the test cases evaluated; Quantile normalization of duration-type parameters also improves both precision and recall in the test cases evaluated; 8

Conclusion Scientific payloads produced by modern scientific experiments have common features that cause clustering algorithms show different clustering behavior on such payloads: DBSCAN algorithm showed marginally better clustering precision and recall (on the scale of 10%) on scientific payloads tested than on time series. K-means algorithm shown the best performance and scalability, and its precision can be boosted by preprocessing data before clustering, particularly by using data scaling that reflects nature of scaled values (log scaling, quantile scaling, etc.) 9

Thank you for your attention! 10

Efficient Cluster Analysis for Scientific Payloads in Distributed Computing

Download Presentation

Presentation Transcript

Related

More Related Content