Simplified Data Processing on Large Cluster by Youhui Bai

mapreduce n.w
1 / 23
Embed
Share

Learn about the simplified data processing technique known as MapReduce on large clusters. The content covers motivation, introduction, implementation, evaluation, and experiences related to handling big data efficiently.

  • Data Processing
  • MapReduce
  • Big Data
  • Cluster Computing
  • Youhui Bai

Uploaded on | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.

E N D

Presentation Transcript


  1. MapReduce Simplified Data Processing on Large Cluster Youhui Bai

  2. Outline Motivation Introduction Implementation Evaluation Experience

  3. Motivation Graph structure of web documents A set of most frequent queries Useful data

  4. How to get useful data? Parallelize MPI, OpenMP Distributed With multi-machine working together Parallelize Computation Distribute Data Handle Failures Complex Programming

  5. Motivation Simple programming API Library Parallelize Computation Distribute Data Fault tolerance Load balance

  6. Outline Motivation Introduction Implementation Evaluation Experience

  7. Introduction: an example Word count count count count count Split data Split data count Very big data merged count merge Split data count count Split data count A A C D E E A A C D E E { A :[1,1]} { A :1, A :1} A:2 C:1 D:1 E:2 { C :1, D :1} accumulation { C :1, D :1} { E :[1,1]} { E :1, E :1}

  8. Map+Reduce R E D U C E M A P Very big data Partitioning Function Result Reduce Accept intermediate key/value pairs Emits output key/value pairs Map Accept input key/value pairs Emits intermediate key/value pairs

  9. Outline Motivation Introduction Implementation Evaluation Experience

  10. Execution overview M: 200000 R: 5000 Machine: 2000

  11. Master data structure For each map and reduce task Idle, in-progress, completed Location Map master reduce Completed In-progress

  12. Fault tolerance master Worker 1 Worker 2 Worker 3 Worker n Completed map task idle In-progress map task idle In-progress reduce task idle

  13. Locality Inter d Inter d split Worker split Worker Inter d Inter d Worker split Worker split Inter d Inter d Worker split Map task read splited data From GFS, but located on the machine Reduce task read intermediate data

  14. Backup tasks master RTT=1 RTT=1 RTT=10 RTT=1 Worker 1 Worker 2 Worker 3 Worker n straggler When a MapReduce operation is close to completion Master schedules backup execution of in-progress tasks The task is marked as completed either the primary or the backup execution complete

  15. Refinements User-specified partitioning functions Ordering guarantee User-specified combiner functions Custom input and output types A mode for execution on single machine

  16. Outline Motivation Introduction Implementation Evaluation Experience

  17. Evaluation Experiment setup 1800 machines Two 2 GHz Intel Xeon processors/machine 4 GB memory Two 160 GB IDE disks A gigabyte Ethernet link Two level tree-shaped switched network

  18. Sort 1 TB data M = 15000, R = 4000, machines = 1800 Done 891 seconds 1057 seconds for TeraSort benchmark Done

  19. Outline Motivation Introduction Implementation Evaluation Experience

  20. Experience MapReduce in distributed WAL Write-ahead logging Used by most transactional systems Databases, file systems Reliability Forward processing All updates go to log first, then real place Recovery Replay winners, rollback losers

  21. Write-ahead logging Used by most transactional systems Databases, file systems Reliability Forward processing All updates go to log first, then real place Recovery Replay winners, rollback losers Data Update Bal=500 Updates 1. Log it 2. Really change it

  22. Recovery Transaction-level : Redo Log0: 0 1 0 4 6 3 Log1: 4 2 1 3 6 Log2: 5 2 5 Map

  23. Recovery Transaction-level : Redo Log0: 0 1 0 4 6 Log1: 3 4 2 1 3 6 Log2: 5 2 5 Reduce

More Related Content