Friday, April 11, 2014

Updates on Spark

In the earlier blog `Is it Spark 'vs' OR 'and' Hadoop?`, we looked at Spark and Hadoop at a very high level. MapReduce is batch oriented in nature, so is not well suited for some types of jobs. Spark is well suited for iterative processing as well as for interactive analysis.

For any new framework to be adopted within an enterprise, it's very important that commercial support be available for the framework unless there is enough expertise within the enterprise to know the internal details on how a particular framework works and be able to support it. Cloudera formed a partnership with Databricks (1) October, 2013 and has included Spark (2) in the CDH distribution. Here (3) is Cloudera's vision on Spark. Cloudera put it succinctly by mentioning

The leading candidate for “successor to MapReduce” today is Apache Spark. Like MapReduce, it is a general-purpose engine, but it is designed to run many more workloads, and to do so much faster than the older system.

Now it's MapR turn to form a partnership with Databricks (4) and include it in the MapR distributions (5). There is a lot of traction around Spark, but more and more inclusions of Spark in the commercial distributions will make the foundation of Spark more solid. Spark can be installed/configured on HortonWorks distribution manually (6, 7), but there is no partnership between Databricks and HortonWorks as with Cloudera and MapR.
It's not Hadoop is going away, but Hadoop 2x will be used as the underlying platform for storing huge amounts of data through HDFS and also for running different computing models using YARN (8).

One interesting fact from the Apache blog (9) is that Spark was created in 2009 by at the University of California at Berkeley's AMPLab. So, it took almost 5 years to get to the current state.

Spark has one of the best documentation (10) around open source. Fast Data Processing with Spark is the only book around Spark as of now and covers installing and using Spark. Also, looks like Learning Spark: Lightning-fast big data analytics book is still a work-in-progress and will be released end of 2014.

Note (2nd May, 2014) : Now Hortonworks is also including Spark in HDP. More here.

1 comment: