Map Reduce Programming Model and Implementations in Cloud Computing

Slide Note

The computing world is transitioning to centralized cloud architecture with the Map Reduce system. Learn about the paradigm shift, benefits, and practical applications of Map Reduce programming. Explore how Google has successfully implemented data-intensive paradigms, enabling efficient data processing and storage in the cloud.

villescas_b Follow

Uploaded on Apr 12, 2025 | 1 Views

Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.

Download Presentation

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.

E N D

Presentation Transcript

CHAPTER - 14 Map Reduce Programming Model and Implementations

Introduction Recently the computing world has been undergoing a transformation from the traditional non centralized distributed system architecture to centralized cloud computing architecture The computations and data are operated in the cloud i.e data centers owned and maintained by third party Cloud computing has been motivated by many factors: Low cost of system hardware The increase in computing power and storage capacity Massive growth in data size Web authoring Scientific instruments etc.. Still the main challenge in the cloud How to effectively store, query, analyze and utilize these datasets

Traditional Data to Computing Versus Computing to Data Paradigm.. Traditional computing Data to Computing to Data Data stored in separate repository Data brought into the system for computation ( time consuming & limits interactivity) System maintains data Computation with storage(fast access) Infrastructure collects and Co-located Programs described at very low level specify detailed control of processing and communications Rely on small number of software packages Application written in terms of highlevel operations on data Runtime system controls scheduling, load balancing etc.. Programming Model programs

Introduction ... Google has successfully implemented and practiced the new data intensive paradigm in their map reduce system The map reduce system runs on top of the google file system , within which data are loaded , partitioned into chunks and each chunk is replicated Data processing is co-located with data storage When the file needs to be processed , the job scheduler consults a storage metadata service to get the host node for each chunk and then schedules a map process on that node So that data locality is exploited efficiently Map reduce has been widely applied in various fields: Data and compute intensive applications, machine learning, graphic programming etc..

14.2 Map Reduce Programming Model Map Reduce is a software framework for solving many large scale computing problems Map and reduce functions , which are commonly used in functional languages such as Lisp It allows users to easily express their computation as map and reduce functions The map function , written by the user, processes a key/value pair to generate set of intermediate key/value pairs: Map(key1, value1) ----> list(key2, value2) The reduce function , also written by the user, merges all intermediate values associated with the same intermediate key: Reduce(key2, list(value2)) ----> list(value2)

14.2.1 WORDCOUNT EXAMPLE: The wordcount application counts the number of occurences of each word in the large collection of documents. Steps: The input is read and broken up into the key/value pairs The pairs are partitioned into groups for processing , and they are sorted according to their key as they arrive for reduction Finally, the key/value pairs are reduced, once for each unique key in the sorted list, to produce a combined result

14.2.2 MAIN FEATERES OF MAP REDUCE Data aware: When the map reduce master node is scheduling the map task for a newly submitted job, the data location information retrieved from the GFS master node Simplicity: With the use of parallelism and concurrency control , allows programmer to easily design parallel and distributed applications. Managebility: In traditional data intensive applications,need two levels of management To manage input data and then move theses data and prepare them to be executed To manage output data In the google map reduce model, Data and computation are allocated With the advantage of GFS, it is easier to manage the input and output data

Main Feateres of Map Reduce ... Scalability: Increasing no. Of nodes in the system, will increase the performance of the jobs with minor losses Fault Tolerance and relaiability: Replication in GFS, can achieve high reliabilty by Rerunning all the tasks when a host node is going offline Rerunning failed tasks on another node Launching backup tasks when these tasks are slowing down

14.2.3 Execution Overview The Map Reduce library in the user program first splits the input files into M pieces with 16 to 64 MB per piece It then starts many copies of the program on a cluster. One is master , remaining is workers - The master is responsible for scheduling and monitoring - When the map task arise, the master assigns the task to an idle worker - A worker reads the content of the corresponding input, split and emits a key/value pairs to the user-defined map function - The intermediate key/ value pairs produced by the map function are first buffered in memory and then periodically written to the local disk, then partitioned into R sets by the partitioning function - The master passes the location of the stored pairs to the reduce worker, which reads the buffered data from the map worker using remote procedure call - It then sorts intermediate keys, so that all occurances of the same key are grouped togerther - For each key , the worker passes the correspondent intermediate value to the reduce function - Finally the output is available in R output files

14.2.4 SPOTLIGHT ON GOOGLE MAP REDUCE IMPLEMENTATION It targets large clusters of linux PC s connected through Ethernet swithes Tasks are forked using remote procedure calls Buffering and communication occurs by reading and writing files on GFS The runtime library is written in c++, with interface in python and java Map reduce jobs are spread across its massive computing clusters For example, average map reduce jobs in sep 2007 ran across approximately 400 machines, approximately 11,000 machine years in a single month the system delivered

14.3 Major Map reduce implementations for the cloud 14.3.1Hadoop: It is a top level apache project Used by all over the world It is stated as a collection of various sub projects for reliable, scalable distributed computing Hadoop common Avro Hbase HDFS Hive MapReduce Pig ZooKeeper

HADOOP MAPREDUCE OVERVIEW Hadoop common , formerly Hadoop Core, includes file system, RPC, and serialization libraries and provides the basic services for building a cloud computing environment The two fundamental subprojects are the Mapreduce framework and the hadoop distributed file system(HDFS) Hadoop distributed file system designed to run on clusters of commodity machine Hadoop mapreduce framework is highly reliant on its shared file system. It has master/slave architecture. The master called job tracker, is responsible for Quering the name node Scheduling the tasks on the slave Monitoring the success and failures of the tasks - The slaves are called task tracker, execute the tasks as directed by the master

HADOOP COMMUNITIES Yahoo! Has been the largest contributor to the hadoop project. It uses hadoop in its web search and advertising business Besides yahoo! Many other vendors have introduced and developed own solutions to the enterprise cloud. ex: IBM blue cloud, cloudera, opensolaris hadoop live CD by sun microsystem, amazon elastic mapreduce

14.3.2 Disco It is an open source mapreduce implementation developed by Nokia The disco core is written in Erlang, the users write jobs in python It is based on master slave architecture When the disco master receives jobs from clients, it adds them to the job queue, and runs them in the cluster when CPUs become available On each node there is a worker supervisor , responsible for spawning(creating) and monitoring all the running python worker processes within the node The python worker runs the assigned tasks and then send the addresses of the resulting files to the master through their supervisor An httpd daemon (web server) runs on each node which enables a remote python worker to access files from the local diskof that particular node

14.3.3 MapReduce.Net It aims to provide support for a wide variety of data intensive and compute intensive applications MapReduce.Net is designed for the windows platform, with emphasis on reusing as many existing windows components as possible Ref. Fig 14.6- Architecture of MapReduce.Net The MapReduce.Net runtime library is assisted by several components services from Aneka and runs on WinDFS Aneka supports the development and deployment of .Net based cloud applications in public cloud environments , such as Amazon EC2 MapReduce .Net can also work with Common Internet File System(CIFS) or NTFS

14.3.4 Skynet Skynet is a Ruby implementation of MapReduce, created by Geni. It is an adaptive , self- upgrading , fault tolerant and fully distributed system with no single point of failure It is plug-in based message queue architecture Message queuing allowing workers to watch out for each other If a worker fails, another worker will notice and pick up that tasks Currently two message queue implementations available, one built on Rinda and another built on MySql

14.3.5 GridGain It is an open cloud platform, developed in java It allows users to develop and run applications on private or public clouds It defines the process of splitting an initial task into multiple subtasks, executing these subtasksin parallel and aggregating results back to one final result New features of GridGain are: Distributed task session Checkpoints for long running tasks Early and late load balancing Affinity co-location with data grid

14.4 MapReduce impacts and research directions different implementations of MapReduce QT Concurrent is a C++ library for multi threaded application Standford s Phoenix targets shared memory architecture Mars is a MapReduce framework on graphics processors(GPUs) Hadoop, disco, skynet and gridgain are open source implementations for large scale data processing MapReduce merge is an extension of MapReduce. A merge phase to easily process data relationships among hetrogeneous datasets Other efforts focus on enabling to support a wider range of applications At present , many research institutions are working to optimize the performance of MapReduce for the cloud in two directions: The first one is driven by the simplicity of the MapReduce Scheduler. The authors introduced the new scheduling algorithm called Longest Approximate Time to End(LATE) to improve the performance of Hadoop in Hetrogeneous environment The Second is driven by the increasing maturity of virtualization technology

Map Reduce Programming Model and Implementations in Cloud Computing

Download Presentation

Presentation Transcript

Related

More Related Content