Big Data and Cloud Tips: What should be the input/ouput key/value types be for the combiner?

Friday, December 16, 2011

What should be the input/ouput key/value types be for the combiner?

When only a map and reducer class are defined for a job, the key/value pairs emitted by the mapper are consumed by the by the reducer. So, the output types for the mapper should be the same as the reducer.

(input) <k1, v1> -> map -> <k2, v2> -> reduce -> <k3, v3> (output)

When a combiner class is defined for a job, the intermediate key value pairs are combined on the same node as the map task before sending to the reducer. Combiner reduces the network traffic between the mappers and the reducers.

Note that the combiner functionality is same as the reducer (to combine keys), but the combiner input/output key/value types should be of the same type, while for the reducer this is not a requirement.

(input) <k1, v1> -> map -> <k2, v2> -> combine* -> <k2, v2> -> reduce -> <k3, v3> (output)

In the scenario where the reducer class is also defined as a combiner class, the combiner/reducer input/ouput key/value types should be of the same type (k2/v2) as below. If not, due to type erasure the program compiles properly but gives a run time error.

(input) <k1, v1> -> map -> <k2, v2> -> combine* -> <k2, v2> -> reduce -> <k2, v2> (output)

Big Data and Cloud Tips

Pages

Friday, December 16, 2011

What should be the input/ouput key/value types be for the combiner?

No comments:

Post a Comment