Map Recude

What is it?

A programming paradigm that allows processing large quantities of information with parallelism.

It is divided into two stages:

  1. "MAP" is responsible for soliciting/ordering/processing information independently/isolated. The result is a set of ordered key-value pares.

  2. "Reduce" is responsible for combining/grouping the information from the previous stage into a very reduced set of data (at most one).

Algorithm

Chunks of information are processed by Mappers in an isolated and because of that, these can be distributed in and network.

The result of the Mapper's work is an intermediate sub-product that is supplied to the Reducers in a process denominated shuffling process.

The result of the Reducers is the final result.

Example

Histogram of word in Lusiadas.

  • Begin by creating chunks - to simplify, each verse is a chunk.

  • Map - for each verse, a key-value pair list with each word and the number of times it appears, is created.

  • Reduce - the previous result will be summed and aggregated.

  • In the end, the histogram is the result.

Advantages

Parallel processing

Each task is completely independent, we divide the problem to simplify it.

Information placement

The data is not centralized, but distributed by all the computation nodes.

Instead of transmitting data between nodes, the Map and Reduce are the ones that migrate to the location of the data.

Last updated