Hadoop MapReduce and GFS Architecture

database applications 15 415 n.w
1 / 9
Embed
Share

Explore how Hadoop MapReduce leverages distributed analytics engines for handling big data, along with insights into the Google File System (GFS) architecture for data-intensive applications. Learn about Map and Reduce phases, HDFS storage layer, chunk distribution policies, and the scalability of these systems in processing massive datasets spanning Gigabytes to Petabytes.

  • Hadoop MapReduce
  • GFS Architecture
  • Big Data Analytics
  • Distributed Systems
  • Data Storage

Uploaded on | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.

E N D

Presentation Transcript


  1. Database Applications (15-415) Hadoop Lecture 26, April 19, 2016 Mohammad Hammoud

  2. Hadoop MapReduce MapReduce is one of the most successful realizations of large- scale data-parallel distributed analytics engines Hadoop is an open source implementation of MapReduce Hadoop MapReduce uses Hadoop Distributed File System (HDFS) as a distributed storage layer HDFS is an open source implementation of GFS

  3. GFS Data Distribution Policy The Google File System (GFS) is a scalable DFS for data- intensive applications GFS divides large files into multiple pieces called chunks or blocks (by default 64MB) and stores them on different data servers This design is referred to as block-based design Each GFS chunk has a unique 64-bit identifier and is stored as a file in the lower-layer local file system on the data server GFS distributes chunks across cluster data servers using a random distribution policy

  4. GFS Random Distribution Policy Blk Blk Blk 0 Blk 1 Blk 4 Blk 6 Blk 5 Large File 2 3 Server 2 Server 3 Server 1 Server 0 (Writer) Blk 0 Blk 0 Blk 1 Blk 0 0M Blk 1 Blk 2 Blk 2 Blk 1 64M Blk 2 Blk 3 Blk 4 Blk 4 128M Blk 3 Blk 3 Blk 6 192M Blk 4 Blk 5 256M Blk 5 Blk 5 320M Blk 6 Blk 6 384M

  5. GFS Architecture GFS adopts a master-slave architecture File name GFS client Master Contact address Chunk Id, range Chunk Server Chunk Server Chunk Server Chunk data Linux File System Linux File System Linux File System

  6. The Problem Scope Hadoop MapReduce is used for powerful and efficient analytics over Big Data The power of MapReduce lies in its ability to scale to 100s and even 1000s of machines What amount of work can MapReduce handle? Big Data in the order of 100s of GBs, TBs or PBs It is unlikely that datasets of such sizes can fit on a single machine Hence, a storage layer like HDFS is required! 6

  7. Hadoop MapReduce: A Systems View Hadoop MapReduce incorporates two phases, Map and Reduce phases, which encompass multiple Map and Reduce tasks Map Task Partition Partition HDFS BLK Split 0 Partition Reduce Task Partition Partition Partition Map Task HDFS BLK Split 1 Partition Partition Dataset Reduce Task To HDFS Partition Partition Map Task HDFS BLK Split 2 Partition HDFS Partition Partition Partition Reduce Task Partition Map Task HDFS BLK Split 3 Partition Partition Merge Stage Reduce Phase Shuffle Stage Reduce Stage Map Phase 7

  8. Data Structure: Keys and Values The MapReduce programmer has to specify only two sequential functions, the Map and the Reduce functions These functions will be translated automatically into multiple Map and Reduce tasks In MapReduce, data elements are always structured as key-value (i.e., (K, V)) pairs In particular, the Map and Reduce functions receive and emit (K, V) pairs Input Splits Intermediate Outputs Final Outputs Map Function Reduce Function (K, V) Pairs (K , V ) Pairs (K , V ) Pairs

  9. WordCount: An Application View A Map Function Key2 Value2 Mohammad 1 A Chunk of File Key1 Value1 is 1 Mohammad is delivering a Parse & Count 0 Mohammad is A Reduce Function Key2 Value2 delivering 1 A Text File lecture at CMUQ Mohammad 1 20 delivering a a 1 is 2 lecture 1 18 lecture at CMUQ Mohammad is delivering a lecture at CMUQ CMUQ is a member of QF delivering 1 at 1 a 2 CMUQ 1 Iterate & Sum lecture 1 A Map Function at 1 Key2 Value2 A Chunk of File CMUQ 2 CMUQ 1 member 1 Key1 Value1 CMUQ is a member of QF is 1 Parse & Count of 1 0 CMUQ is a a 1 QF 1 member 1 17 member of QF of 1 QF 1 9

Related


More Related Content