Intermediate Data in Cloud Computations

on availability of intermediate data in cloud n.w

1 / 32

Embed Share

Explore the role of intermediate data in dataflow programming frameworks in cloud environments. Delve into the significance of intermediate data and outline potential solutions. Learn about dataflow programming frameworks like MapReduce (Hadoop), Pig, and Hive for massive-scale data processing.

isi_rah Follow

Uploaded on Jul 07, 2025 | 0 Views

Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.

Download Presentation

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.

E N D

Presentation Transcript

On Availability of Intermediate Data in Cloud Computations Steven Y. Ko, Imranul Hoque, Brian Cho, and Indranil Gupta Distributed Protocols Research Group (DPRG) University of Illinois at Urbana-Champaign

Our Position Intermediate data as a first-class citizen for dataflow programming frameworks in clouds 2

Our Position Intermediate data as a first-class citizen for dataflow programming frameworks in clouds Dataflow programming frameworks 3

Our Position Intermediate data as a first-class citizen for dataflow programming frameworks in clouds Dataflow programming frameworks The importance of intermediate data 4

Our Position Intermediate data as a first-class citizen for dataflow programming frameworks in clouds Dataflow programming frameworks The importance of intermediate data Outline of a solution 5

Our Position Intermediate data as a first-class citizen for dataflow programming frameworks in clouds Dataflow programming frameworks The importance of intermediate data Outline of a solution This talk Builds up the case Emphasizes the need, not the solution 6

Dataflow Programming Frameworks Runtime systems that execute dataflow programs MapReduce (Hadoop), Pig, Hive, etc. Gaining popularity for massive-scale data processing Distributed and parallel execution on clusters A dataflow program consists of Multi-stage computation Communication patterns between stages 7

Example 1: MapReduce Two-stage computation with all-to-all comm. Google introduced, Yahoo! open-sourced (Hadoop) Two functions Map and Reduce supplied by a programmer Massively parallel execution of Map and Reduce Stage 1: Map Shuffle (all-to-all) Stage 2: Reduce 8

Example 2: Pig and Hive Pig from Yahoo! & Hive from Facebook Built atop MapReduce Declarative, SQL-style languages Automatic generation & execution of multiple MapReduce jobs 9

Example 2: Pig and Hive Multi-stage with either all-to-all or 1-to-1 Stage 1: Map Shuffle (all-to-all) Stage 2: Reduce 1-to-1 comm. Stage 3: Map Stage 4: Reduce 10

Usage 11

Usage Google (MapReduce) Indexing: a chain of 24 MapReduce jobs ~200K jobs processing 50PB/month (in 2006) Yahoo! (Hadoop + Pig) WebMap: a chain of 100 MapReduce jobs Facebook (Hadoop + Hive) ~300TB total, adding 2TB/day (in 2008) 3K jobs processing 55TB/day Amazon Elastic MapReduce service (pay-as-you-go) Academic clouds Google-IBM Cluster at UW (Hadoop service) CCT at UIUC (Hadoop & Pig service) 12

One Common Characteristic Intermediate data Intermediate data? data between stages Similarities to traditional intermediate data E.g., .o files Critical to produce the final output Short-lived, written-once and read-once, & used-immediately 13

One Common Characteristic Intermediate data Written-locally & read-remotely Possibly very large amount of intermediate data (depending on the workload, though) Computational barrier Stage 1: Map Computational Barrier Stage 2: Reduce 14

Computational Barrier + Failures Availability becomes critical. Loss of intermediate data before or during the execution of a task => the task can t proceed Stage 1: Map Stage 2: Reduce 15

Current Solution Store locally & re-generate when lost Re-run affected map & reduce tasks No support from a storage system Assumption: re-generation is cheap and easy Stage 1: Map Stage 2: Reduce 16

Hadoop Experiment Emulab setting (for all plots in this talk) 20 machines sorting 36GB 4 LANs and a core switch (all 100 Mbps) Normal execution: Map Shuffle Reduce Map Shuffle Reduce 17

Hadoop Experiment 1 failure after Map Re-execution of Map-Shuffle-Reduce ~33% increase in completion time MapShuffl Map Shuffle Reduce Reduce e 18

Re-Generation for Multi-Stage Cascaded re-execution: expensive Stage 1: Map Stage 2: Reduce Stage 3: Map Stage 4: Reduce 19

Importance of Intermediate Data Why? Critical for execution (barrier) When lost, very costly Current systems handle it themselves. Re-generate when lost: can lead to expensive cascaded re-execution No support from the storage We believe the storage is the right abstraction, not the dataflow frameworks. 20

Our Position Intermediate data as a first-class citizen for dataflow programming frameworks in clouds Dataflow programming frameworks The importance of intermediate data Outline of a solution Why is storage the right abstraction? Challenges Research directions 21

Why is Storage the Right Abstraction? Replication stops cascaded re-execution. Stage 1: Map Stage 2: Reduce Stage 3: Map Stage 4: Reduce 22

So, Are We Done? No! Challenge: minimal interference Network is heavily utilized during Shuffle. Replication requires network transmission too. Minimizing interference is critical for the overall job completion time. Any existing approaches? HDFS (Hadoop s default file system): much interference (next slide) Background replication with TCP-Nice: not designed for network utilization & control (no further discussion, please refer to our paper) 23

Modified HDFS Interference Unmodified HDFS Much overhead with synchronous replication Modification for asynchronous replication With an increasing level of interference Four levels of interference Hadoop: original, no replication, no interference Read: disk read, no network transfer, no actual replication Read-Send: disk read & network send, no actual replication Rep.: full replication 24

Modified HDFS Interference Asynchronous replication Network utilization makes the difference Both Map & Shuffle get affected Some Maps need to read remotely 25

Our Position Intermediate data as a first-class citizen for dataflow programming frameworks in clouds Dataflow programming frameworks The importance of intermediate data Outline of a new storage system design Why is storage the right abstraction? Challenges Research directions 26

Research Directions Two requirements Intermediate data availability to stop cascaded re-execution Interference minimization focusing on network interference Solution Replication with minimal interference 27

Research Directions Replication using spare bandwidth Not much network activity during Map & Reduce computation Tight B/W monitoring & control Deadline-based replication Replicate every N stages Replication based on a cost model Replicate only when re-execution is more expensive 28

Summary Our position Intermediate data as a first-class citizen for dataflow programming frameworks in clouds Problem: cascaded re-execution Requirements Intermediate data availability Interference minimization Further research needed 29

BACKUP 30

Default HDFS Interference Replication of Map and Reduce outputs 31

Default HDFS Interference Replication policy: local, then remote-rack Synchronous replication 32

Intermediate Data in Cloud Computations

Download Presentation

Presentation Transcript

Related

More Related Content