Introduction to Map-Reduce Programming Model

Slide Note

Map-Reduce is a programming model that offers a novel way to solve complex problems by abstracting the design process. It enables handling huge volumes of data efficiently, making computational resources more accessible. The model simplifies code execution on multiple machines, freeing programmers from dealing with intricate details. Learn about the history of Map-Reduce, its roots in LISP, and the concepts of map and reduce functions. Discover how Map-Reduce simplifies data processing and enhances scalability, making it a valuable tool in modern computing environments.

dunnigan_b Follow

Uploaded on Mar 09, 2025 | 0 Views

Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.

Download Presentation

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.

E N D

Presentation Transcript

Intro to Map-Reduce Feb 21, 2014

map-reduce? A programming model or abstraction. A novel way of thinking about designing a solution to certain problems Feb 21, 2014 CS512 | Spring 2014

Why? We have access to huge volumes of data. Facebook posts and photos, twitter streams, Feb 21, 2014 CS512 | Spring 2014

Easy Computational resources are cheap. Amazon EC2, Microsoft Azure, Feb 21, 2014 CS512 | Spring 2014

Not really! Code running on one CPU is simple, two is a headache, four is a nightmare, You get the picture. Feb 21, 2014 CS512 | Spring 2014

Pipe Dream Forget multiple machines. Just write code imagining one CPU. Someone else takes care of running the code on thousands of machine. Free the programmer from the unnecessary details. Feb 21, 2014 CS512 | Spring 2014

Map-Reduce map-reduce programming model to the rescue Feb 21, 2014 CS512 | Spring 2014

Long long ago LISP, 1958 A programming language that introduced several innovative ideas Recursive Functions of Symbolic Expression and Their Computation by Machine, Part I John McCarthy, MIT, April 1960 Feb 21, 2014 CS512 | Spring 2014

LISP Introduced map and reduce. Feb 21, 2014 CS512 | Spring 2014

Map map(mf, [a1, a2, an]) -> [b1, b2, , bn] Accepts two arguments: a function and a list of values. Generates output by repeatedly applying the function on the list of values. Feb 21, 2014 CS512 | Spring 2014

Reduce reduce(rf, [b1, b2, bn]) -> c Accepts two arguments: a function and a list of values. Generates output by reducing the list of input values using the function. Feb 21, 2014 CS512 | Spring 2014

Simple composition Map s output is a list of values, which reduce can accept as one of its argument. Feb 21, 2014 CS512 | Spring 2014

Analogy Break large problem into small pieces Code mf to solve one piece Run map to apply mf on the small pieces and generate nuggets of solutions Code rf to combine the nuggets Run reduce to apply rf on the nuggets to output the complete solution Feb 21, 2014 CS512 | Spring 2014

Example 1TB file split into 100,000 chunks Count number of lines in each chunk Add counts together to output final line count Feb 21, 2014 CS512 | Spring 2014

A slightly different map-reduce Map Copies a function on a number of machines and applies each copy on different pieces of the input Reduce Combine the map outputs from different machines into a final solution Feb 21, 2014 CS512 | Spring 2014

Map-reduce reintroduced Google created the awareness Hadoop made it into a sensation Hadoop is an open-source map-reduce implementation based on Google s paper. MapReduce: Simplified Data Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat OSDI'04: Sixth Symposium on Operating System Design and Implementation. December, 2004. Feb 21, 2014 CS512 | Spring 2014

Hadoop Feb 21, 2014 CS512 | Spring 2014

Example Feb 21, 2014 CS512 | Spring 2014

Terminology Mapper Instance of the map function Reducer Instance of the reduce function Feb 21, 2014 CS512 | Spring 2014

Job User s implementation of map and reduce functions Feb 21, 2014 CS512 | Spring 2014

Splitting the input User submits job and specifies the input files. Input files are split into chunks. Chunks are fed to the mappers typically over a distributed file system like HDFS. Feb 21, 2014 CS512 | Spring 2014

Copying the job JobTracker Hadoop service that copies the map and reduce code to available machines. Feeds the input to the mappers and connects their outputs to reducers. Feb 21, 2014 CS512 | Spring 2014

Maps in parallel Maps run in parallel. Each maps operates on a set of chunks assigned to it by the job tracker. Maps write to local disk. Feb 21, 2014 CS512 | Spring 2014

What if maps fail? Re-run failed maps. No need to re-run succeeded maps. Why does this work? Maps typically are idempotent. Feb 21, 2014 CS512 | Spring 2014

Input to reducers #reducers(n) known a priori. #partitions equals #reducers. Hash on the keys of mapper outputs. partition = hash(key) mod n Load balancing by randomization. Feb 21, 2014 CS512 | Spring 2014

Wait before reducing Cannot start reducers before mappers complete. Synchronization barrier between map and reduce phases. Why? Feb 21, 2014 CS512 | Spring 2014

Embarrassing Parallelism Feb 21, 2014 CS512 | Spring 2014

Not a panacea! If your workload exhibits embarrassing parallelism, Hadoop might be the ideal framework. If not, look for other parallel programming paradigms. Feb 21, 2014 CS512 | Spring 2014

Example Feb 21, 2014 CS512 | Spring 2014

WordCount https://developer.yahoo.com/hadoop/tutorial/module4.html Feb 21, 2014 CS512 | Spring 2014

Mapper public static class MyMapper implements Mapper<LongWritable, Text, Text, IntWritable> { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { // Split the given line (in value) into words and emit for each word the tuple <word, 1> String line = value.toString(); StringTokenizer itr = new StringTokenizer(line); while (itr.hasMoreTokens()) { word.set(itr.nextToken()); output.collect(word, one); } } } https://developer.yahoo.com/hadoop/tutorial/module4.html Feb 21, 2014 CS512 | Spring 2014

Reducer public static class MyReducer implements Reducer<Text, IntWritable, Text, IntWritable> { public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { int sum = 0; while (values.hasNext()) { sum += values.next().get(); } output.collect(key, new IntWritable(sum)); } } https://developer.yahoo.com/hadoop/tutorial/module4.html Feb 21, 2014 CS512 | Spring 2014

Introduction to Map-Reduce Programming Model

Download Presentation

Presentation Transcript

Related

More Related Content