Supporting Deadlines and Priorities in a MapReduce Cluster
Hadoop jobs have dual priorities for production and research tasks, with a need to streamline job management for efficiency. The State-of-the-art solution involves a consolidated cluster approach to prioritize production while minimizing impact on research tasks. Techniques like scaling down research jobs, job eviction policies, and task eviction policies are proposed to optimize job completion times. Natjam is integrated into the Hadoop YARN Architecture to preempt tasks and efficiently manage job resources.
Download Presentation

Please find below an Image/Link to download the presentation.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.
You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.
E N D
Presentation Transcript
NATJAM: SUPPORTING DEADLINES AND PRIORITIES IN A MAPREDUCE CLUSTER Brian Cho (Samsung/Illinois), Muntasir Rahman, Tej Chajed, Indranil Gupta, Cristina Abad, Nathan Roberts (Yahoo! Inc.), Philbert Lin University of Illinois (Urbana-Champaign) Distributed Protocols Research Group (DPRG): http://dprg.cs.uiuc.edu 1
Hadoop Jobs have Priorities Dual Priority Case Production jobs (high priority) Time sensitive Directly affect criticality or revenue Research jobs (low priority) e.g., long term analysis Example: Ad provider Ad click-through logs Count clicks Update ads Is there a better way to place ads? Slow counts Show old ads Don t get paid $$$ Prioritize production jobs Run machine learning analysis Daily and Historical logs. http://dprg.cs.uiuc.edu 2
State-of-the-art: Separate clusters Production cluster receives production jobs (high priority) Research cluster receives research jobs (low priority) Traces reveal large periods of under-utilization in each cluster Long job completion times Human involvement in job management Goal: single consolidated cluster for all priorities and deadlines Prioritize production jobs and yet affect research jobs least Today s Options: Wait for research tasks to finish (e.g., Capacity Scheduler) Prolongs production jobs Kill research tasks (e.g., Fair Scheduler) can lead to repeated work Prolongs research jobs http://dprg.cs.uiuc.edu 3
Natjams Techniques 1. Scale down research jobs by Preempting some Reduce tasks Fast on-demand automated checkpointing of task state Later, reduces can resume where they left off Focus on Reduces: Reduce tasks take longer, so more work to lose (median Map 19 seconds vs. Reduce 231 seconds [Facebook]) 2. Job Eviction Policies 3. Task Eviction Policies http://dprg.cs.uiuc.edu 4
Natjam built into Hadoop YARN Architecture Preemptor Chooses Victim Job Reclaims queue resources Releaser Chooses Victim Task Local Suspender Saves state of Victim Task Resource Manager Capacity Scheduler preempt() Preemptor ask container # containers to release Node A Node B Node Manager A Node Manager B Application Master 1 Task (App2) Task (App1) Application Master 2 release() Task (App2) resume() suspend (empty container) Local Suspender Releaser Releaser Local Suspender saved state http://dprg.cs.uiuc.edu 5
Suspending and Resuming Tasks Existing intermediate data used Reduce inputs, stored at local host Reduce outputs, stored on HDFS Suspended task state saved locally, so resume can avoid network overhead Checkpoint state saved Key counter Reduce input path Hostname List of suspended task attempt IDs HDFS Task Attempt 1 (Suspended) Container freed, Suspend state saved tmp/task_att_1 Key Counter outdir/ Inputs (Resumed) Task Attempt 2 tmp/task_att_2 (skip) Key Counter Inputs http://dprg.cs.uiuc.edu 6
Two-level Eviction Policies Resource Manager On a container request in a full cluster: 1. Job Eviction @Preemptor Capacity Scheduler preempt() Preemptor # containers to release Node A Node B Node Manager A Node Manager B Application Master 1 Task (App2) Application Master 2 release() Task (App2) 2. Task Eviction @Releaser Releaser Local Suspender Releaser Local Suspender 7 http://dprg.cs.uiuc.edu
Job Eviction Policies Based on total amount of resources (e.g., containers) held by victim job (known at Resource Manager) 1. Least Resources (LR) Large research jobs unaffected Starvation for small research jobs (e.g., repeated production arrivals) 2. Most Resources (MR) Small research jobs unaffected Starvation for the largest research job 3. Probabilistically-weighted on Resources (PR) Weigh jobs by number of containers: treats all tasks same, across jobs Affects multiple research jobs http://dprg.cs.uiuc.edu 8
Task Eviction Policies Based on time remaining (known at Application Master) 1. Shortest Remaining Time (SRT) Leaves the tail of research job alone Holds on to containers that would be released soon Longest Remaining Time (LRT) May lengthen the tail Releases more containers earlier However: SRT provably optimal under some conditions Counter-intuitive. SRT = Longest-job-first scheduling. Now 2. http://dprg.cs.uiuc.edu 9
Eviction Policies in Practice Task Eviction SRT 20% faster than LRT for research jobs Production job similar across SRT vs. LRT Theorem: When research tasks resume simultaneously, SRT results in shortest job completion time. Job Eviction MR best PR very close behind LR 14%-23% worse than MR MR + SRT best combination http://dprg.cs.uiuc.edu 10
Natjam-R: Multiple Priorities Special case of priorities: jobs with real-time deadlines Best-effort only (no admission control) Resource Manager keeps single queue of jobs sorted by increasing priority (derived from deadline) Periodically scans queue: evicts later job to give to earlier waiting job Job Eviction Policies 1. Maximum Deadline First (MDF): Priority = Deadline Prefers short deadline jobs May miss deadlines, e.g., schedules a large job instead of a small job with a slightly large deadline 2. Maximum Laxity First Priority = Laxity = Deadline minus Job s Projected Completion time Pays attention to job s resource requirements 11
Job deadlines MDF vs. MLF in Practice 100 90 80 70 Job 1 Map 60 Progress Job 2 Map 50 Job 3 Map 40 Job 1 Reduce 30 MDF prefers short deadlines 20 Job 2 Reduce 10 Job 3 Reduce 0 0 50 100 150 200 250 300 time (s) 100 90 80 70 Job 1 Map MLF moves in lockstep Misses all deadlines 60 Progress Job 2 Map 50 Job 3 Map 40 Job 1 Reduce 30 20 Job 2 Reduce 10 Job 3 Reduce 0 8 node cluster Yahoo! trace experiments in paper 0 50 100 150 200 250 300 time (s)
Natjam vs. Alternatives time (seconds) Microbenchmark: 7 node cluster t=0s Research-XL (100% of cluster) t=50s Production-S (25% of cluster) Research-XL Job Production-S Job 350 300 50% worse than ideal Average Execution Time (seconds) 250 20% worse than ideal 200 2% worse than ideal 15% better than Killing 150 90% worse than ideal 100 7% worse than Ideal 40% better than Soft cap 50 0 Ideal Capacity scheduler: Hard cap Capacity scheduler: Soft cap Killing Natjam Empty Cluster 13
Large Experiments 250 nodes @Yahoo!, Driven by Yahoo! traces Natjam vs. Waiting for research tasks (Hadoop Capacity Scheduler: Soft cap) Production jobs: 53% benefit, 97% delayed < 5 s Research jobs: 63% benefit, very few outliers (low starvation) Natjam vs. Killing research tasks Production jobs: largely unaffected Research jobs: 38% finish faster than 100 s 5th percentile faster than 750 s Biggest improvement: 1880 s Negligible starvation http://dprg.cs.uiuc.edu 14
Related Work Single cluster job scheduling has focused on: Locality of Map tasks [Quincy, Delay Scheduling] Speculative execution [LATE Scheduler] Average fairness between queues [Capacity Scheduler, Fair Scheduler] Recent work: Elastic queues but uses Sailfish needs special intermediate file system, does not work with Hadoop [Amoeba] Mapreduce-5269 JIRA: Preemption in Hadoop http://dprg.cs.uiuc.edu 15
Takeaways Natjam supports dual priority and arbitrary priorities (derived from deadlines) SRT (Shortest remaining time) best policy for task eviction MR (Most resources) best policy for job eviction MDF (Maximum deadline first) best policy for job eviction in Natjam-R 2-7% Overhead for dual priority case Please see our poster + demo video later today! http://dprg.cs.uiuc.edu 16
Backup slides http://dprg.cs.uiuc.edu 17
Contributions Our system Natjam allows us to Maintain one cluster With a production queue and a research queue Prioritize production jobs and complete them quickly While affecting research jobs the least (Later: Extend to multiple priorities.) http://dprg.cs.uiuc.edu 18
Hadoop 23s Capacity Scheduler Limitation: research jobs cannot scale down Hadoop capacity shared using queues Guaranteed capacity (G) Maximum capacity(M) 1. Production job submitted first: (under-utilization) P takes 80% time R can only grow to 40% 2. Research job submitted first: Example Production (P) queue: G 80%/M 80% Research (R) queue: G 20%/M 40% (under-utilization) R takes 40% time P cannot grow beyond 60% http://dprg.cs.uiuc.edu 19
Natjam Scheduler Does not require Maximum capacity Scales down research jobs by Preempting Reduce tasks Fast on-demand automated checkpointing of task state Resumption where it left off Focus on Reduces: Reduce tasks take longer, so more work to lose (median Map 19 seconds vs. Reduce 231 seconds [Facebook]) 1. P/R Guaranteed: 80%/20% R takes 100% time P takes 80% 2. P/R Guaranteed: 100%/0% R takes 100% time P takes 100% Prioritize Production Jobs http://dprg.cs.uiuc.edu 20
Yahoo! Hadoop Traces: CDF of differences (negative is good) 7-node cluster Production Jobs: Natjam - Killing Production Jobs: Natjam - Soft Cap Research Jobs: Natjam - Killing Research Jobs: Natjam - Soft Cap 1 1 0.8 0.8 0.6 0.6 CDF CDF 0.4 0.4 0.2 0.2 0 0 -2500 -2000 -1500 -1000 -500 0 500 -150 -100 -50 0 50 100 150 Difference in Job Completion Time (seconds) Difference in Job Completion Time (seconds) Production Jobs: Natjam - Killing Production Jobs: Natjam - Soft Cap 250-node Yahoo! cluster Research Jobs: Natjam - Killing Research Jobs: Natjam - Soft Cap 1 1 0.8 0.8 Only two starved jobs 260 s and 390 s 0.6 0.6 CDF CDF Largest benefit 1880 s 0.4 0.4 0.2 0.2 0 0 -300 -200 -100 0 100 200 300 400 500 -2000 -1500 -1000 -500 0 500 Difference in Job Completion Time (seconds) Difference in Job Completion Time (seconds) 21