Exploiting Data Distribution in Distributed Learning of Deep Classification Models

Slide Note

Deep learning is prevalent in various applications due to higher classification accuracy, but training complex models can be time-consuming. Distributed learning architectures aim to reduce synchronization overheads, with the parameter server architecture minimizing delays from stale gradients. This PhD research delves into systematically assigning training data to workers to optimize training efficiency and benefit from data distribution. Neural networks, optimization techniques like gradient descent variants, and the parameter server architecture play key roles in this study to enhance training speed and model accuracy through algorithmic batch extraction and mini-batch selection.

keif_81 Follow

Uploaded on Feb 23, 2025 | 0 Views

Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.

Download Presentation

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.

E N D

Presentation Transcript

Exploiting data distribution in distributed learning of deep classification models under the parameter server architecture VLDB 2021 PhD Workshop NIKODIMOS NIKODIMOS PROVATAS PROVATAS SUPERVISED BY PROF. SUPERVISED BY PROF. NECTARIOS NECTARIOS KOZIRIS KOZIRIS

Deep Learning widely used in a variety of applications higher classification accuracy more data leads to more complex model architectures numerous hours to train in CPU machines need to accelerate training (GPUs, distributed setups) Distributed Learning Architectures All reduce Synchronization overheads harm training speed Parameter Server Architecture No overheads, stale gradients harm accuracy Introduction In this PhD Research training data distribution could training benefit from it? singlenode, parameter server Systematic data assignment to workers Training access patterns

Neural Networks Background (I) Optimization in ML weights : solution of optimization problem loss function gradient descent (GD) Gradient Descent Variants GD slow iterations view on whole distribution / iteration Mini-Batch Stochastic Gradient Descent subset of data / iteration more iterations to converge may stuck on local minima Adam, AdaGrad, etc.

Background (II) Parameter Server Architecture Parameter Servers store neural networks in distributed fashion TRAINING MODEL UNDER PARAMETER SERVER ARCHITECTURE Workers local model copy assigned data shard compute gradients with SGD variants Usually asynchronous aggregation on global model

Mini Batch SGD restricted view on data / iteration not moving directly towards opt. point shuffling less correlated data Exploiting data distribution Algorithmic batch extraction mini-batch representative of data faster learning / higher accuracy ? data preprocess Bengio Curriculum Learning better convergence based on data traits Single Node Training PhD Research Related Goals Study algorithmic mini batch selection representative to data distribution Example on CIFAR-10 cluster each classes to 2 subclasses mini-batch from all subclasses 5% / 2% improvement in validation loss / error

Stale Gradients Effect gradients using old model parameters possible solutions variable learning rate (Dutta - 2013) learning rate techniques heterogeneous environment (Jiang - 2017) FlexPS: staleness parameter aging (Huang - 2018) Exploiting data distribution Better Data Assignment up to now algorithmic solutions parameter server / learning rate level systematic data sharding up to now random (TensorFlow using mod) data skew in class level may occur unbalanced classes (Imagenet example) Idea! Idea! exploit stratification and hidden stratification use in data sharding / per worker mini-batch selection smooth staleness effect Parameter Server Training Stratification Related Work approximate query processing learning from heterogeneous databases medical image classification locality / data sensitive hashing Imagenet Data Distribution

Research Plan 01 02 03 04 Measure the effects of mini-batch design in single node training representative mini- batches exploit real time metrics Evaluate techniques to extract distribution traits from big data sets clustering algorithms data sensitive hashing Apply distribution related information in data sharding. stratification distribution traits Streaming system for serving mini-batches to workers collect data information efficiently prepare mini- batches

Apache Spark Apache Spark Clustering Algorithms Clustering Algorithms SparkML Metrics Computation Metrics Computation SparkML Technologies Google TensorFlow Google TensorFlow Model Training Model Training Parameter Server Architecture Parameter Server Architecture Benchmarking Setup Measurements on training speed training / validation metrics Baseline training state of the art models under the parameter server setup Apache Arrow Apache Arrow streaming data batches streaming data batches optimize data in columnar format optimize data in columnar format CPU / GPU computations CPU / GPU computations TensorFlow Compatibility TensorFlow Compatibility Compare with models exploiting distribution traits