Understanding MapReduce: Motivation, Abstraction, and Examples

1 / 27

Embed Share

Explore the motivation behind MapReduce, its abstraction goals, map and reduce functions, examples like inverted index and distributed grep, and the role of functional programming in this paradigm. Learn how MapReduce simplifies distributed computing by handling complexities such as parallelization, communication, data distribution, and fault tolerance effectively.

sjoz Follow

Uploaded on Jun 04, 2025 | 0 Views

Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.

Download Presentation

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.

E N D

Presentation Transcript

MapReduce Some materials are from MIT EECS 6.824, Danyang Zhuo (Duke), and Jeff Chase (Duke)

Motivation (2004) Google engineers had implemented hundreds of special-purpose computations that process large amounts of raw data Google wanted to enable programmers without any experience with parallel and distributed systems to easily utilize the resources of a large distributed systems (hundreds of thousands of machines)

Abstraction Goals Express simple computations Hide complexities: Parallelization Communication Data distribution and load balancing Fault tolerance

Map and Reduce Example

Why is this interface good? The program does not say how to parallelize the computation. The program does not say how to facilitate communication between different machines. The program does not say how the input is partitioned or how the output should be partitioned. The program does not say how failure should be handled.

Map and Reduce functions map: (key1, value1) list(key2, value2) reduce: (key2, list(value2)) list(value3) key2 is called an intermediate key In our example: key1 is document name (not used), value1 is document contents key2 is a word, value2 is a count (always 1) value3 is a count (list always has length 1)

Examples Inverted index: Find all the webpages that contain each keyword. Map outputs list(<keyword, URL>) Reverse web-link graph: for a webpage, find all the webpages pointing to it. Map outputs list(<target URL, source URL>) Distributed grep: pattern matching across a large set of files. Map outputs lines matching pattern. Reduce is the identity function.

Functional programming Functional programming is a programming paradigm where programs are constructed by applying and composing functions. It is a declarative programming paradigm in which function definitions are trees of expressions that map values to other values, rather than a sequence of imperative statements which update the running state of the program. [Wikipedia] 1. No side effects 2. No mutable state Rationale for MapReduce: Much easier to deal with failures!

Any backend should work! A small shared-memory machine A large NUMA multi-processor A large collection of networked machine (this paper s implementation)

Main contribution of the paper An interface design for automated parallelization An implementation of the interface that achieves high performance on large clusters See also open-source Apache Hadoop MapReduce and Hadoop Distributed File System (HDFS)

MapReduce

Overall flow 1. A MapReduce library in the user program splits the input files into M pieces of typically 16 to 64 MB per piece. It then starts many copies of the program on a cluster of machines. 2. One special copy is master. The rest are workers. There are M map tasks and R reduce tasks. The master assigns tasks to workers. 3. A worker assigned a map task reads a piece of input, passes the key- value pair to the user-defined Map function. Results are buffered in memory.

Overall flow 4. Buffered pairs are written to local disks, partitioned into R regions. Locations of these files are passed back to the master, who forwards these locations to the workers running reduce tasks. 5. When a reduce worker receives the locations of the intermediate files, it uses RPCs to read those files. When all the intermediate data is ready, it sorts the data, and groups by intermediate keys. 6. For each unique intermediate key, the reducer worker runs the user- defined Reduce function. The output is appended to the final output. 7. After all map and reduce tasks are completed. The master wakes up the user program.

What does the master do? Assign map and reduce tasks to workers Track the progress of tasks Forward locations of intermediate files

Fault tolerance Worker failure The master pings every worker periodically Map/reduce tasks on failed workers rerun on other workers Sometimes rerun is not needed if output is already consumed When a map task reruns, all the reduce workers must be notified of the new intermediate file locations Master failure Abort the job Why is this fine?

What happens to the output if tasks can rerun? Atomic commits When a task generates an output, it is first put into a temp file. When the task completes, it renames the file and send a message to the master If tasks are deterministic: Strong semantics: output is the same as the output of a non-faulty sequential execution of map and reduce tasks If tasks are not deterministic: Weaker semantic: each reducer s output is the same as the output of a non- faulty sequential execution of non-deterministic map and reduce tasks

Locality Files are stored in a distributed file system (Google file system, which will be covered later in the course) Each file has three copies in the cluster When scheduling a map task, the master tries to choose a worker that is co-located with a copy of the input data Or a worker that is close to a copy of the input data When running a large MapReduce operation, most input data is read locally and consumes zero network bandwidth

Whats the granularity of tasks? There are O(M+R) scheduling decisions There are O(M * R) piece of states In practice, people want small R, because users do not want the output to be separated into many pieces (M = 200000, R = 5000) on 2000 worker machines What s the problem of large granularity? Small granularity?

Straggler mitigation A machine can be slow due to many reasons Faulty disk Faulty configurations Other tasks running on the machine MapReduce s solution is to schedule backup executions of remaining tasks. This increases the total workload (similar to primary backup replication), so need to limit usage to only a few percent.

How to split data? Mappers execute in parallel; reducers execute in parallel For mappers, split the input files into M pieces For reducers, use a partitioning function (e.g., hash(key) mod R) Users can also provide a partitioning function. Might want to group related keys into the same output file.

Other details Map workers may run a user defined combine function to reduce the size of intermediate files, e.g., add word counts on map worker before reducing Side effects must be atomic and idempotent Otherwise, the semantics are broken when tasks fail and rerun Skipping bad records Record processing may need usage of third-party libraries. Rerun task and skip the bad records. Visualizing progress is important for developers

Why not store intermediate results in GFS? In a distributed file system, files are replicated for fault tolerance. Storing the intermediate results in distributed file systems requires additional storage space and is slower. Intermediate files are deleted anyway

What if multiple reducers want to output to the same output file? Atomic rename makes sure that only one output file is finally generated

Why is sorting needed for intermediate files? Ultimately, what you want is a type of groupby . There are two common implementations of groupby , you can learn more through taking a database course. Hash-based groupby Sort-based groupby

Limitations In 2004, Google tried to optimize for network usage Technique: Locality provided by scheduling a map task on a machine that stores the input data Technique: combine function run before reduce Design exploits embarrassing parallelism No communication during map phase No communication during reduce phase Only data exchange in-between Ethernet speed: 100-200Gbps for 1800 machines Today s Ethernet speed: > 100Gbps for a single machine Why should intermediate data be stored on disks? Spark does this Iterative, stateful computation on matrices/tensors TensorFlow

Understanding MapReduce: Motivation, Abstraction, and Examples

Download Presentation

Presentation Transcript

Related

More Related Content