Monday, February 3, 2014

Introduction to Spark

In an earlier blog we looked at RDD, they form the basis for Spark. I planned to write in detail about Spark, but DBMS2 does a very good job summarizing about `Spark and Databricks` here. Databricks which is in stealth mode would be mostly providing services (in cloud) and commercial support around Spark. Cloudera is actively pushing (1, 2) Spark and sooner or later we would see Spark in CDH.

Spark and Machine Learning are a nice combination. MapReduce processing provides high latency and high throughput and is not well suited for ML processing which are iterative in nature. R and Python (1, 2) interfaces to Spark are still a work-in-progress. So, over time it should be possible to use the rich ML/Statistical libraries of R/Python with Spark.

Btw, here is the original paper on Spark.

1 comment:

  1. Way sooner than you expect :) http://blog.cloudera.com/blog/2014/02/spark-is-now-generally-available-for-cloudera-enterprise/

    ReplyDelete