Monday, March 31, 2014

What is a Big Data cluster?

Very often I get the query `What is a cluster?` when discussing about Hadoop and Big Data. To keep it simple `A cluster is a group or a network of machines wired together acting a single entity to work on a task which when run on a single machine takes much more longer time.` The given task is split and processed by multiple machines in parallel and so that the task gets completed faster. Jesse Johnson puts it in simple and clear terms what a cluster is all about and how to design distributed algorithms here.
IMG_9370 by NeoSpire from Flickr under CC
In a Big Data cluster, the machines (or nodes) are neither as powerful as a server grade machine nor as dumb as a desktop machine. Having multiple (like in thousands) server grade machines doesn't make sense from a cost perspective, while a Desktop grade machine fails often which has to be appropriately handled. Big Data clusters have a collection of commodity machines which fall in between a server and a desktop grade machine.

Similar to open source software projects like Hadoop and others, Facebook started Open Computer Project around computing infrastructure. Facebook doesn't see any edge from their competitors by having a specialized and distinguished hardware from the rest and has been opening some of it's internal infrastructure designs. Anyone can take a design, modify the same and come up with their own hardware.

I am not into much of hardware, but it makes sense if the different data centers (like those from Amazon, Google, Microsoft and others) have a common specification around hardware as it brings down the data center building cost due to the scale of manufacturing and the R&D costs. It's very similar to what had been happening in the Apache and the Linux space, different companies work together is a collaborative environment on a common goal to make software better and enjoy the benefits of the same.

Note (21st April, 2014) : Here is a nice article from ZDNet on how Facebook saved $$$ using the Open Compute Project.

Lipstick on Pig for monitoring and visualizing jobs

As mentioned in the previous blog, Pig and Hive are higher level abstractions on top of MapReduce. Given a task like joining of  two data sets, it's much more easier to join the data sets using Pig and Hive as it takes less coding effort when compared to MapReduce. So, many companies are going with Pig and Hive as they provide better developer productivity.

The problem with abstraction is that it gives less control on what can and cannot be done and debugging with higher abstraction is also difficult as it hides the underlying details. Same is the case with Pig and Hive also.

Some time back Netflix open sourced Lipstick. Google also recently published a blog entry recently around the same. Pig converts the PigLatin scripts into a DAG of MapReduce and the underlying MR data flows can be difficult to visualize. Lipstick enables developers to visualize and monitor the execution of the Pig data flows at a logical level (aka MR). Earlier, this had to be done using the log files or by looking at the MR Web console.
urban decay lipstick naked2 3 from Flickr by ldhren under CC
Netflix and Twitter had been very aggressive in open sourcing their internal projects. With so much choice around, there had not been a better time around software to take an idea from concept to realization. One of the main criteria for picking a framework or a software is the support provided by commercial vendors. A good percentage of the softwares around Big Data are free and can be put in production with minimal cost, but lack commercial support for the sake of lower downtime. Lipstick also falls under the same category. It has not been included in any of the commercial Big Data distributions like the one from Cloudera, Hortonworks, MapR and others. So, Lipstick has to be installed manually and patching (for any bugs/improvements) has to be taken care of by the end user.

In an upcoming blog, we will look into how to install and configure Lipstick on top of Pig.

Friday, March 28, 2014

Mahout and MR

There has been a active discussion (1, 2, 3) in the Mahout Dev mailing list about the goals for Mahout 1.0 and also moving the underlying computation engine from MR to Spark or H20.  But as mentioned in the GigaOM article `Apache Mahout, Hadoop’s original machine learning project, is moving on from MapReduce`, the community hasn't yet decided yet.

As mentioned in the earlier blogs here, MR is by default batch oriented in nature and is also not suited for iterative processing and implementing Machine Learning algorithms as processing with MR involves R/Ws to HDFS after each step in the iteration. Mahout is pretty much tied to MR, though it's not impossible to rewrite the underlying MR algorithms, it's also not an easy task. It would be the right direction for the Mahout project to move to some non-MR platform and the sooner the better.

With the announcement of Oryx from Cloudera, we can expect quick progress around the distributed Machine Learning frameworks.
Directions by MShades From Flickr under CC

Friday, March 7, 2014

Is it Spark 'vs' OR 'and' Hadoop?

There had been a lot of traction around Spark. Cloudera announced (1, 2) it being a part of the CDH distribution and here is there stance on `MR and Spark`. MapReduce is a programming model for distribution computing while Spark is a framework or a Software. The essence of the Cloudera article is accurate, but the blog title is a bit misleading. It should be Hadoop and Spark.

Here is another interesting article from Cloudera on Spark from a Data Science perspective. Mahout implements Machine Learning algorithms in a MR fashion. But, MR is batch oriented (high latency and high throughput) in nature and doesn't suit well for Machine Learning algorithms which are iterative in nature. There is a active discussion in the Mahout community for decoupling the ML algorithms from MR where possible. It is a long shot, but the sooner the effort the better as alternate distributed ML frameworks are cropping.

With Spark becoming a top level project and lot of attention to Spark, one might think what will happen to Hadoop that Spark is gaining all the attention. So, this blog is all about it.
But, before jumping into Spark, let's look at how Hadoop has evolved. Hadoop 1 has two components HDFS (for storing the data at large scale) and MR (for processing the data at large scale) which operate on a group of commodity machines and act like a single cluster (or an entity) to work on a very big task. In Hadoop 2 (aka YARN, Next Generation Architecture, MRv2), Hadoop additionally constitutes YARN which has a centralized Resource Manager which allows multiple computing models (powered by YARN) to run on the same cluster. Executing multiple models on the same cluster will increase the utilization of the cluster and mostly flatten the usage of cluster.

Along, the same lines Spark applications can also run a resource/cluster management framework like YARN and Apache Mesos. So, it's not a matter of making a choice between Spark and Hadoop, but Spark can run on top of YARN (Hadoop 2) and can also use HDFS as a source for data. More on the different Spark execution models from the official documentation here.
With so much traction happening around Spark, it's interesting to see how work loads gets moved from some of the other frameworks like Hadoop (MR) to Spark. Note that MR is a distributing computing model and can be implemented on a distributed computing model like Spark. Here is a simple Scala program for WordCount on top of Spark.
file = spark.textFile("hdfs://...")
file.flatMap(line => line.split(" "))
    .map(word => (word, 1))
    .reduceByKey(_ + _) 
With so much happening in the Big Data space, I will try to keep this blog updated with some of the latest happenings.