Monday, September 10, 2012

Why does Hadoop uses KV (Key/Value) pairs?

Hadoop implements the MapReduce paradigm as mentioned in the original Google MapReduce Paper in terms of key/value pairs. The output types of the Map should match the input types of the Reduce as shown below.

(K1,V1) -> Map -> (K2,V2)
(K2, V2) -> Reduce -> (K3,V3)

The big question is why use key/value pairs?

MapReduce is derived from the concepts of functional programming. Here is a tutorial on the map/fold primitives in the functional programming which are used in the MapReduce paradigm and there is no mention of key/value pairs anywhere.

According to the Google MapReduce Paper

We realized that most of our computations involved applying a map operation to each logical “record” in our input in order to compute a set of intermediate key/value pairs, and then applying a reduce operation to all the values that shared the same key, in order to combine the derived data appropriately.

The only reason I could see why Hadoop used key/value pairs is that the Google Paper had key/value pairs to meet their requirements and the same had been implemented in Hadoop. And everyone is trying to fit the the problem space in the key/value pairs.

Here is an interesting article on using tuples in MapReduce.

No comments:

Post a Comment