Friday, November 4, 2011

Hadoop MapReduce challenges in the Enterprise

Platform Computing published a five part series (one, two, three, four, five) about the Hadoop MapReduce Challenges in the Enterprise. Some of the challenges mentioned in the Series are addressed by the NextGen MapReduce which will be available soon for download, but some of the allegations are not proper. Platform has got products around MapReduce and is about to be acquired by IBM, so not sure how they got them wrong.

Platform) On the performance measure, to be most useful in a robust enterprise environment a MapReduce job should take  sub-millisecond to start,  but the job startup time in the current open source MapReduce implementation is measured in seconds.

Praveen) MapReduce is supposed to be for batch processing and not for online transactions. The data from a MapReduce Job can be fed to a system for online processing. It's not to say that there is no scope for improvement in the MapReduce job performance.

Platform) The current Hadoop MapReduce implementation does not provide such capabilities. As a result, for each MapReduce job, a customer has to assign a dedicated cluster to run that particular application, one at a time.

Platform) Each cluster is dedicated to a single MapReduce application so if a user has multiple applications, s/he has to run them in serial on that same resource or buy another cluster for the additional application.

Praveen) Apache Hadoop had a pluggable Scheduler architecture and has Capacity, Fair and FIFO Scheduler. The FIFO Scheduler is the default one. Schedulers allow multiple applications and multiple users to share the cluster at the same time.

Platform) Current Hadoop MapReduce implementations derived from open source are not equipped to address the dynamic resource allocation required by various applications.

Platform) Customers also need support for workloads that may have different characteristics or  are  written in different programming languages. For instance, some of those applications could be data intensive such as MapReduce applications written in Java, some could be CPU intensive such as Monte Carlo simulations which are often written in C++ -- a runtime engine must be designed to support both simultaneously.

Praveen) NextGen MapReduce allows for dynamic allocation of resources. Currently there is only support for RAM based requests, but the framework can be extended for other parameters like CPU, HDD etc in the future.

Platform) As mentioned in part 2 of this blog series, the single job tracker in the current Hadoop implementation is not separated from the resource manager, so as a result, the job tracker does not provide sufficient resource management functionalities to allow dynamic lending and borrowing of available IT resources.

Praveen) NextGen MapReduce separates resource management and task scheduling into separate components.

To summarize, NextGen MapReduce addresses some of the concerns raised by Platform, but it will take some time for NextGen MapReduce to get stabilized and be production ready.

1 comment:

  1. There are some really loved reading your hadoop blog. It was very well authored and easy to understand. Unlike additional blogs I have read which are really not good. I also found your posts very interesting. In fact after reading, I had to go show it to my friend and he enjoyed it as well!
    Hadoop Training in hyderabad