Friday, December 16, 2011

What should be the input/ouput key/value types be for the combiner?

When only a map and reducer class are defined for a job, the key/value pairs emitted by the mapper are consumed by the by the reducer. So, the output types for the mapper should be the same as the reducer.

(input) <k1, v1> -> map -> <k2, v2> -> reduce -> <k3, v3> (output)

When a combiner class is defined for a job, the intermediate key value pairs are combined on the same node as the map task before sending to the reducer. Combiner reduces the network traffic between the mappers and the reducers.

Note that the combiner functionality is same as the reducer (to combine keys), but the combiner input/output key/value types should be of the same type, while for the reducer this is not a requirement.

(input) <k1, v1> -> map -> <k2, v2> -> combine* -> <k2, v2> -> reduce -> <k3, v3> (output)

In the scenario where the reducer class is also defined as a combiner class, the combiner/reducer input/ouput key/value types should be of the same type (k2/v2) as below. If not, due to type erasure the program compiles properly but gives a run time error.

(input) <k1, v1> -> map -> <k2, v2> -> combine* -> <k2, v2> -> reduce -> <k2, v2> (output)

1 comment:

  1. This is one of the most incredible blogs on hadoop Ive read in a very long time. The amount of information in here is stunning, like you practically wrote the book on the subject. Your blog is great for anyone who wants to understand this subject more. Great stuff; please keep it up!
    Hadoop Training in hyderabad