septembre, 16, 2009

A friend recently asked me to explain her the concept behind MapReduce. The last time we saw each other, I had no time to do that, so I now use my  blog for this purpose.

Map and reduce are particular functions that can be found we in numerous functional languages (such as python, ocaml, etc.).

Map is a function that takes as input an other function, let’s say f and a data structure (most of the time a sequence), that calls the function f on each of the structure’s items and returns a list of the return values. Here is an example: suppose that I have a sequence S=[1,5,4,2,6,7] and that I want to compute the square of each of the integers in this sequence using the shortest line of code possible. I use map for this purpose and call map(x^2,S), map then applies x^2 on each item of the sequence S and outputs [1,25,16,4,36,49].

Reduce is a function that also takes as input an other function, let’s say f and a data structure (most of the time a sequence), but that calls the function f not on each item of the structure, but on the whole structure. For instance map(+,S) outputs the sum of the integers in S, e.g. 25.

I can now explain what is the concept of the famous Google’s MapReduce.

This is a computation framework for large distributable data sets. Suppose you have a very large data set (let’s say an entire crawl of the web), and that you want to apply on it a very quickly a time consuming algorithm using a lot of computers. The first step is to take the input (in our case this huge set of web pages) and to cut it into small problems that can be solved independently by your large set of computers. Each of this computer has the goal to solve one of these small problems, then to send the result back to one specific computer (the master). This first step is the map step of MapReduce. At this point the results given by the computers are all gathered (recombined) in order to obtain the result of the original problem. This is the reduce step.

This approach is nice: computation are done in a distributed way, so the time spent for the computation is very short. The computation is fault-tolerant (if a computer crashes, its sub problem is given to an other one). But unfortunately MapReduce can only be applied to very specific problems (where there are no communication between computers during the computation).

To finish a crappy whiteboard+iphone picture to illustrate this concept:


The goal is of course to « colorize quickly the picture »…

3 Responses to “MapReduce?”

Yeah, well MapReduce is a simple instance of Algorithmic Skeletons as developped by Murray Cole and al. in early 90’s. Google just marketed that with fancy name.

joel_f, 20 septembre 2009 à 18:16

Joel, I strongly agree with you 😉

Sylvain, 20 septembre 2009 à 18:31

of course 😮 Fun fact, I met some guys from Google at PACT 2009 and they were like « no we didn’t know about 😮 »

joel_f, 21 septembre 2009 à 21:44
Picture: courtesy of Abby Blank