Optimizing Cluster Scheduling for Data Analytics Efficiency

altruistic scheduling in multi altruistic n.w

1 / 46

Embed Share

Explore the latest advancements in altruistic scheduling and cluster efficiency strategies for multi-resource clusters in data analytics. Discover how long-term fairness, job performance, and cluster efficiency are prioritized to enhance scheduling flexibility and optimize completion times for data-parallel jobs. Dive into the complexities of balancing fairness, efficiency, and performance objectives in cluster scheduling for improved analytics outcomes.

nmoli Follow

Uploaded on Jun 03, 2025 | 1 Views

Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.

Download Presentation

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.

E N D

Presentation Transcript

Altruistic Scheduling in Multi Altruistic Scheduling in Multi- -Resource Clusters Carbyne Resource Clusters Robert Grandl, Mosharaf Chowdhury, Aditya Akella, Ganesh Ananthanarayanan

Performance of Cluster Schedulers We observe that: Existing cluster schedulers focus oninstantaneous fairness Long-term fairness enables larger scheduling flexibility Data-parallel jobs provide ample opportunities for long-term optimizations Carbyne 1.3x higher cluster efficiency; 1.6x lower average job completion time; near-perfect fairness 2

Scheduling in Data Analytics Clusters Cluster Wide Resource Manager Inter-Job Scheduler Intra-Job Scheduler Jobs Cluster resources Objectives: #1: Fairness #3: Cluster Efficiency #2: Job Performance = is difficult! 3

Scheduling in Data Analytics Clusters Makespan (seconds) Jain s Fairness Index Avg. JCT (seconds) 1 8000 1000 6000 0.5 4000 500 2000 0 0 0 Job Performance Cluster efficiency Inter-job fairness Objectives: #1: Fairness #3: Cluster Efficiency #2: Job Performance = is difficult! 4 TPC-DS workload on a 100-machine cluster

Scheduling in Data Analytics Clusters Makespan (seconds) Jain s Fairness Index Avg. JCT (seconds) 1 8000 1224 1000 6000 5478 0.5 4000 500 0.86 2000 0 0 0 Job Performance Cluster efficiency Inter-job fairness Objectives: #1: Fairness DRF [NSDI 11] #3: Cluster Efficiency #2: Job Performance = is difficult! Dominant Resource Fairness Max-min fair sharing across multiple dimensions 4 TPC-DS workload on a 100-machine cluster

Scheduling in Data Analytics Clusters Makespan (seconds) Jain s Fairness Index Avg. JCT (seconds) 1 8000 1224 1000 6210 6000 5478 769 0.5 4000 500 0.86 0.64 2000 0 0 0 Job Performance Cluster efficiency Inter-job fairness Objectives: #1: Fairness DRF [NSDI 11] #3: Cluster Efficiency #2: Job Performance = is difficult! Shortest Job First (SJF) Schedule jobs near completion first 4 TPC-DS workload on a 100-machine cluster

Scheduling in Data Analytics Clusters Makespan (seconds) Jain s Fairness Index Avg. JCT (seconds) 1 8000 1224 1000 1123 6210 6000 5478 769 0.74 0.5 4356 4000 500 0.86 0.64 2000 0 0 0 Job Performance Cluster efficiency Inter-job fairness Objectives: #1: Fairness DRF [NSDI 11] #3: Cluster Efficiency #2: Job Performance = is difficult! Tetris [SIGCOMM 14] Shortest Job First (SJF) Efficiently pack resources 4 TPC-DS workload on a 100-machine cluster

Scheduling in Data Analytics Clusters Makespan (seconds) Jain s Fairness Index Avg. JCT (seconds) 1 8000 1224 1000 1123 6210 6000 5478 769 0.74 0.5 4356 4000 500 0.86 0.64 2000 0 0 0 Job Performance Cluster efficiency Inter-job fairness Objectives: #1: Fairness DRF [NSDI 11] #3: Cluster Efficiency #2: Job Performance = is difficult! Tetris [SIGCOMM 14] Shortest Job First (SJF) Outperforms Preferred metric Underperforms Secondary metrics 4 TPC-DS workload on a 100-machine cluster

Scheduling in Data Analytics Clusters Makespan (seconds) Jain s Fairness Index Avg. JCT (seconds) 1 8000 1224 1000 1123 6210 6000 5478 769 0.74 0.5 0.74 4356 4000 500 0.86 0.64 2000 0 0 0 Job Performance Cluster efficiency Inter-job fairness Objectives: #1: Fairness DRF [NSDI 11] #3: Cluster Efficiency #2: Job Performance = is difficult! Tetris [SIGCOMM 14] Shortest Job First (SJF) Outperforms Preferred metric Underperforms Secondary metrics 4 TPC-DS workload on a 100-machine cluster

Scheduling in Data Analytics Clusters Makespan (seconds) Jain s Fairness Index Avg. JCT (seconds) 1 8000 1224 1000 1123 6210 6000 5478 769 0.74 0.5 0.74 4356 4000 500 0.86 0.64 2000 0 0 0 Job Performance Cluster efficiency Inter-job fairness Objectives: #1: Fairness DRF [NSDI 11] #3: Cluster Efficiency #2: Job Performance = is difficult! Tetris [SIGCOMM 14] Shortest Job First (SJF) Outperforms Preferred metric Underperforms Secondary metrics 4 TPC-DS workload on a 100-machine cluster

Scheduling in Data Analytics Clusters Makespan (seconds) Jain s Fairness Index Avg. JCT (seconds) 1 8000 1224 1000 1123 6210 6000 5478 769 0.74 0.5 0.74 4356 4000 500 0.86 0.64 2000 0 0 0 Job Performance Cluster efficiency Inter-job fairness Objectives: #1: Fairness DRF [NSDI 11] #3: Cluster Efficiency #2: Job Performance = is difficult! Tetris [SIGCOMM 14] Shortest Job First (SJF) Outperforms Preferred metric Underperforms Secondary metrics 4 TPC-DS workload on a 100-machine cluster

Scheduling in Data Analytics Clusters Makespan (seconds) Jain s Fairness Index Avg. JCT (seconds) 1 8000 1224 1000 1123 6210 6000 5478 769 0.74 0.5 0.74 4356 4000 500 0.86 0.64 2000 0 0 0 Job Performance Cluster efficiency Inter-job fairness Is it possible to ensure fairness and still be competitive with the best approaches for the secondary metrics (job performance and cluster efficiency)? 4

Key observation #1 S1 S0 Stage ID #Tasks x <Res. req> @Dur 3 x <.21> @1 2 x <.08> @.5 S0 S2 2 x <.29> @1 1 x <.1> @.1 Job 2 Job 1 Modern cluster schedulers focus on instantaneous fairness and force short-term optimizations Fair 1.0 Allocation 1 Greedy decisions Capacity 0.5 S0 S0 S1 S1 2 1 0 Time Traditional scheduling Traditional scheduling Assumption: Know tasks demands and durations 5

Key observation #1 S1 S0 Stage ID #Tasks x <Res. req> @Dur 3 x <.21> @1 2 x <.08> @.5 S0 S2 2 x <.29> @1 1 x <.1> @.1 Job 2 Job 1 Modern cluster schedulers focus on instantaneous fairness and force short-term optimizations 1.0 1 Capacity S0 S0 0.5 S0 S0 Avg. JCT: 2.05 S1 S1 S1 S2 2 1 0 Users care less for instantaneous fairness Time Traditional scheduling Traditional scheduling Assumption: Know tasks demands and durations 5

Key observation #1 S1 S0 Stage ID #Tasks x <Res. req> @Dur 3 x <.21> @1 2 x <.08> @.5 S0 S2 2 x <.29> @1 1 x <.1> @.1 Job 2 Job 1 Modern cluster schedulers focus on instantaneous fairness and force short-term optimizations 1.0 1 Capacity S0 S0 0.5 S0 S0 Avg. JCT: 2.05 S1 S0 S0 S1 S1 S2 2 1 0 Users care less for instantaneous fairness Time Traditional scheduling Altruistic scheduling Leftover:donated unnecessary resources Altruism:an action to contribute leftover resources 5

Key observation #1 S1 S0 Stage ID #Tasks x <Res. req> @Dur 3 x <.21> @1 2 x <.08> @.5 S0 S2 2 x <.29> @1 1 x <.1> @.1 Job 2 Job 1 Modern cluster schedulers focus on instantaneous fairness and force short-term optimizations 1.0 S0 1 Capacity S0 S0 0.5 Avg. JCT: Avg. JCT: 2.05 1.33 x better S1 S0 S0 S1 S1 S2 2 1 0 Users care less for instantaneous fairness Time Traditional scheduling Altruistic scheduling Leftover: donated unnecessary resources Altruism: an action to contribute leftover resources 5

Key observation #2 What increases opportunities? Complex DAG structures Longer DAGs Jobs in data analytics clusters have ample opportunities for altruism How much opportunities? 50% of the time at least 20% of the resources can be used as leftover 6

Altruistic multi-resource scheduling technique Carbyne #2. How much leftover should contribute? Inter-Job Scheduler Intra-Job Scheduler Leftover #1. How to maximize the amount of leftover resources? #3. How to redistribute the leftover ? 7

Inter-Job Scheduler Intra-Job Scheduler Leftover Maximize the amount of leftover resources Instantaneous fairness elongates job completion time the most and increases altruism opportunities Carbyne uses DRF for inter-job scheduling Any fair scheduler techniquecan be used Fair S0 1.0 Stage ID 2 x <.29> @1 Allocation #Tasks x <Res. req> @Dur Capacity Job 2 0.5 S1 S0 3 x <.21> @1 2 x <.08> @.5 S2 1 2 0 1 x <.1> @.1 Time 8 Job 1

Inter-Job Scheduler Intra-Job Scheduler Leftover How much leftover to contribute? Traditional scheduling to compute expected completion time Fair S0 1.0 Stage ID 2 x <.29> @1 Allocation #Tasks x <Res. req> @Dur Capacity S0 S0 Job 2 0.5 S1 S0 S0 S0 S1 3 x <.21> @1 2 x <.08> @.5 JCT Job 1: 2.1 S1 S1 S2 S2 JCT Job 2: 2.0 1 2 0 1 x <.1> @.1 Time 9 Job 1

Inter-Job Scheduler Intra-Job Scheduler Leftover How much leftover to contribute? Traditional scheduling to compute expected completion time Scheduling in the future from finish to current time Fair S0 1.0 Stage ID 2 x <.29> @1 Allocation #Tasks x <Res. req> @Dur Move into future Capacity S0 S0 Job 2 0.5 S1 S0 S0 S0 S1 3 x <.21> @1 2 x <.08> @.5 JCT Job 1: 2.1 S1 S1 S2 S2 JCT Job 2: 2.0 1 2 0 1 x <.1> @.1 Time 9 Job 1

Inter-Job Scheduler Intra-Job Scheduler Leftover How much leftover to contribute? Traditional scheduling to compute expected completion time Scheduling in the future from finish to current time Fair S0 1.0 Stage ID Move into future 2 x <.29> @1 Allocation #Tasks x <Res. req> @Dur Capacity S0 S0 Job 2 0.5 S1 S0 S0 S0 S1 3 x <.21> @1 2 x <.08> @.5 JCT Job 1: 2.1 S1 S1 S2 S2 JCT Job 2: 2.0 1 2 0 1 x <.1> @.1 Time 9 Job 1

Inter-Job Scheduler Intra-Job Scheduler Leftover How much leftover to contribute? Traditional scheduling to compute expected completion time Scheduling in the future from finish to current time Donate leftover resources through altruism Leftover Job 1: Fair S0 1.0 Stage ID 0.29 2 x <.29> @1 Allocation #Tasks x <Res. req> @Dur Capacity S0 S0 Job 2 0.5 Total: 0.29 S1 S0 S0 S0 S1 3 x <.21> @1 2 x <.08> @.5 JCT Job 1: 2.1 S1 S1 S2 S2 JCT Job 2: 2.0 1 2 0 1 x <.1> @.1 Time 9 Job 1

Inter-Job Scheduler Intra-Job Scheduler Leftover How much leftover to contribute? Traditional scheduling to compute expected completion time Scheduling in the future from finish to current time Donate leftover resources through altruism Leftover Job 1: Job 2: Total:0.50 Fair S0 1.0 Stage ID 0.29 0.21 0.29 2 x <.29> @1 Allocation #Tasks x <Res. req> @Dur Capacity S0 S0 Job 2 0.5 S1 S0 S0 S0 S1 3 x <.21> @1 2 x <.08> @.5 JCT Job 1: 2.1 S1 S1 S2 S2 JCT Job 2: 2.0 1 2 0 1 x <.1> @.1 Time 9 Job 1

Inter-Job Scheduler Intra-Job Scheduler Leftover How to redistribute the leftover resources? Goal 1: Improve average JCT Schedule jobs closest to completion time first Goal 2: Maximize efficiency Goals 1 and 2 can be interchanged Leftover Total:0.50 Fair S0 1.0 Stage ID 2 x <.29> @1 Allocation #Tasks x <Res. req> @Dur Capacity S0 S0 Job 2 0.5 S1 S0 S0 S0 S1 3 x <.21> @1 2 x <.08> @.5 JCT Job 1: 2.1 S1 S1 S2 S2 JCT Job 2: 2.0 1 2 0 1 x <.1> @.1 Time 10 Job 1

Inter-Job Scheduler Intra-Job Scheduler Leftover How to redistribute the leftover resources? Goal 1: Improve average JCT Schedule jobs closest to completion time first Goal 2: Maximize efficiency Goals 1 and 2 can be interchanged Leftover Total:0.50 Fair S0 1.0 Stage ID 2 x <.29> @1 Allocation #Tasks x <Res. req> @Dur 0.21 Capacity S0 Job 2 0.5 S1 S0 S0 S0 S0 S1 3 x <.21> @1 2 x <.08> @.5 JCT Job 1: 2.1 S1 S1 S2 S2 JCT Job 2: 2.0 JCT Job 2: 1.0 1 2 0 1 x <.1> @.1 Time 10 Job 1

Inter-Job Scheduler Intra-Job Scheduler Leftover How to redistribute the leftover resources? Goal 1: Improve average JCT Schedule jobs closest to completion time first Goal 2: Maximize efficiency Pack as many unscheduled tasks are possible Goals 1 and 2 can be interchanged Leftover Total:0.21 Fair S0 1.0 Stage ID 2 x <.29> @1 Allocation #Tasks x <Res. req> @Dur Capacity S0 Job 2 0.5 S1 S0 S0 S0 S0 S1 3 x <.21> @1 2 x <.08> @.5 JCT Job 1: 2.1 S1 S1 S2 S2 JCT Job 2: 1.0 1 2 0 1 x <.1> @.1 Time 10 Job 1

Inter-Job Scheduler Intra-Job Scheduler Leftover How to redistribute the leftover resources? Goal 1: Improve average JCT Schedule jobs closest to completion time first Goal 2: Maximize efficiency Pack as many unscheduled tasks are possible Goals 1 and 2 can be interchanged Leftover Total:0.21 Fair S0 1.0 Stage ID 2 x <.29> @1 Allocation #Tasks x <Res. req> @Dur 0 Capacity S0 Job 2 0.5 S0 S1 S0 S0 S0 S1 3 x <.21> @1 2 x <.08> @.5 JCT Job 1: 2.1 S1 S1 S2 S2 JCT Job 2: 2.0 1 2 0 1 x <.1> @.1 Time 10 Job 1

Putting it all together We saw Increase leftover via Inter-Job Scheduling Adopting best fair schedulers Compute leftover via Intra-Job Scheduling Leftover redistribution Improve JCT and cluster efficiency Other things in the paper Bounding altruism with P(Altruism) Resource estimation Data locality Straggler mitigation Task failures 11

Implemented in Yarn and Tez Evaluation 100 machine cluster deployment Replay Bing / Facebook traces and TPC-DS / TPC-H workloads 12

Fairness vs. Performance vs. Efficiency Makespan (seconds) Jain s Fairness Index Avg. JCT (seconds) 1 8000 1224 1000 1123 6210 6000 5478 769 0.74 0.5 0.74 4356 4000 500 0.86 0.64 2000 0 0 0 Job Performance Cluster efficiency Inter-job fairness 13

Fairness vs. Performance vs. Efficiency Makespan (seconds) Jain s Fairness Index Avg. JCT (seconds) 1 8000 1224 1000 1123 6210 6000 5478 814 4492 769 0.81 0.74 0.5 0.74 4356 4000 500 0.86 0.64 2000 0 0 0 Job Performance Cluster efficiency Inter-job fairness Gains from Altruism helps in long run Comparable performance with best approaches in each metric 13

Job Performance Snapshot of the execution of a TPC-DS query 20 DRF DRF Tasks from a DRF allocation # Running Tasks 15 Carbyne w/o leftover Only running tasks from being altruistic Carbyne Tasks from being altruistic and from leftover redistribution 10 5 0 301 501 701 901 1101 Time (Seconds) Gains from Altruism helps in long run 14

Job Performance Snapshot of the execution of a TPC-DS query 20 Carbyne w/o Leftover Carbyne DRF DRF Tasks from a DRF allocation # Running Tasks Greedy 15 Carbyne w/o leftover Only running tasks from being altruistic Carbyne Tasks from being altruistic and from leftover redistribution 10 Altruistic 5 0 301 501 701 901 1101 Time (Seconds) Gains from Altruism helps in long run Low resource contention DRF takes greedy decisions Approximates DRF allocation due to leftover 14

Job Performance Snapshot of the execution of a TPC-DS query 20 Carbyne w/o Leftover Carbyne DRF DRF Tasks from a DRF allocation # Running Tasks Leftover 15 Carbyne w/o leftover Only running tasks from being altruistic Carbyne Tasks from being altruistic and from leftover redistribution 10 5 0 301 501 701 901 1101 Time (Seconds) Gains from Altruism helps in long run High resource contention DRF progress is slowed down Carbyne progress faster due to receiving leftover 14

Job Performance Snapshot of the execution of a TPC-DS query 20 1 Carbyne w/o Leftover Carbyne DRF # Running Tasks Fraction of Jobs 0.8 15 0.6 DRF 10 0.4 Tetris 5 0.2 Carbyne 0 0 301 501 701 901 1101 0 250 500 750 1000 1250 Time (Seconds) Job Completion Time (Seconds) Performance Better jobs completion time 16% jobs slowed down 4% only by more than 0.8x Which jobs slow down? Longer jobs No bias towards shorter jobs Most jobs benefit from leftover 14

Impact of a Better Intra-Job Scheduler 1 Default intra-job scheduler Tetris Limited view of the job s DAG 0.8 Fraction of Jobs 0.6 0.4 Carbyne+Tetris 0.2 0 0.5 3 5.5 8 10.5 13 Factor of Improvement (w.r.t. DRF) 15

Impact of a Better Intra-Job Scheduler 1 Default intra-job scheduler Tetris Limited view of the job s DAG 0.8 Fraction of Jobs 0.6 0.4 Carbyne+Tetris 0.2 0 0.5 3 5.5 8 10.5 13 Better intra-job scheduler Graphene DAG-wide view Factor of Improvement (w.r.t. DRF) 15

Impact of a Better Intra-Job Scheduler 1 Default intra-job scheduler Tetris Limited view of the job s DAG 0.8 Fraction of Jobs 0.6 0.4 Carbyne+Tetris 0.2 Carbyne +GRAPHENE 0 0.5 3 5.5 8 10.5 13 Better intra-job scheduler Graphene DAG-wide view Extracts more leftover Further increase performance Factor of Improvement (w.r.t. DRF) 15

Increase Leftover via Inter-Job Scheduling Maximize Leftover for Individual Jobs Redistribution via Leftover Scheduling Carbyne Performance Efficiency Fairness Long-term altruistic view of Carbyne outperforms existing cluster schedulers which focus on instantaneous fairness Implemented inside YARN and Tez Performance comparable with best approaches in terms of fairness and job completion time and cluster efficiency 16

Backup slides

Impact of Altruism Levels 1.5 Factor of improvement 1 DRF 0.5 Tetris 0 0 0.25 0.5 0.75 1 Probability of Making Altruistic Choices Increasing levels of altruism increase performance Comparable when P(Altruism) == 0

Impact of Misestimations Factor of improvement 1.5 1 DRF 0.5 Tetris 0 -50 -25 0 25 50 Error in Resource Misestimations [%] Consistently better performance Comparable when resources are underestimated

Impact of Contention 2.5 Factor of improvement DRF Tetris 2.28 2.22 2 2.19 2.04 1.91 1.76 1.5 1.57 1.35 1 0.5 0 1 2 4 6 Multiple of Original Load More contention increases the need for carefully rearranging tasks Too much contention saturates the cluster; not much room for leftover allocations

Fairness vs. Performance vs. Efficiency - Online 1 Jain's Fairness Index 8000 Makespan (seconds) 1000 Avg. JCT (seconds) 1096 7415 986 6000 6654 0.84 774 5520 0.78 5124 0.5 4000 0.71 643 500 0.62 2000 0 0 0 Even in online case, Carbyne comes closely to the best metric in each metric

Data Locality vs. Straggler Mitigation vs. Task Failures Data Locality Altruistically giving up resources for data-local task may have adverse effects An altruistically delayed data-local task is likely to find data locality when it is eventually scheduled Straggler Mitigation Likely to prioritize speculative tasks during leftover scheduling because it selects jobs in the SRTF order Handling Task Failures Does not distinguish between new and restarted tasks in case of task failures, it has to recalculate the expected completion time for the job

Optimizing Cluster Scheduling for Data Analytics Efficiency

Download Presentation

Presentation Transcript

Related

More Related Content