Big Data and Cloud Tips: Map and Reduce in Python without Hadoop

Wednesday, January 15, 2014

Map and Reduce in Python without Hadoop

MapReduce is not a new programming model, but the Google's paper on MapReduce made it popular. A map is usually used for transformation, while reduce/fold is used for aggregation. They are built-in primitives used in functional programming languages like Lisp and ML. More about the functional programming roots to MapReduce paradigm can be found in Section 2.1 of Data-Intensive Text Processing with MapReduce paper.

Below is a simple Python 2 program using the map/reduce functions. map/reduce are functions in the __builtin__ python module. More about functional programming in Python here. For those using Python3, the reduce function has removed from the __builtin__ package. According to the Python 3.1 release notes :

Removed reduce(). Use functools.reduce() if you really need it; however, 99 percent of the time an explicit for loop is more readable.

The Python 2 program `squares/transforms` a list of 1 to 100 using `map/square` and then `sums/aggregates` them up using the `reduce/add` function. Note that Hadoop which provides a run time environment for executing MapReduce programs also does something similar, but in a distributed fashion to process huge amounts of data.

#!/usr/bin/python 

def square(x):
    return x * x

def add(x, y):
    return x + y

def main():
    print reduce(add, map(square, range(100)))

if __name__ == "__main__":
    main()