Tuesday, September 17, 2013

Storm and Samza for real time processing of data

It had been sometime I had actively written on this site due to some personal reasons. Not with things settled, I would be writing more on more here as I explore through the journey of Big Data.
There are often requirements for both real time and batch processing of the data. As in the case of fraud detection, the more the delay in identifying the fraud the more the probability for damage to happen. So, batch processing models like MapReduce (and Hive and Pig on top of them) won't fit the picture. Here comes the frameworks for real time processing of the data.

LinkedIn has been very active in open sourcing some of the their internal frameworks. Along the same lines Samza has been open sourced for real time processing of data. As in case of any other projects, initially it will be in the incubator stage and the project makes progress, it will be promoted to an Apache Top Level Project (TLP). Here is an article from GigaOm on the same.

Here is a comparison between Storm (released by Twitter) and Samza, both of which are used for real time processing of data. BTW, here (1, 2, 3) are some nice references to Twitter Storm.

Netflix also had also been very active in open sourcing some of their internal projects. Sometime back they announced Lipstick to monitor a Pig job in a graphical way as it makes progress. I started with the installation and configuration of Lipstick, but dropped in between because of the associated complexity with it.

Can't complain free software, it's great that such software is being given away for free. But, we have to adopt them with a bit of wisdom. Not all these frameworks are commercially supported, so the users not only have to concentrate on how to use it, but also on the internals of it. This helps to solve any problems that pop up in the underlying framework.

One of the questions I do get often is when to go with Apache version or a commercial version (like from Cloudera, HortonWorks, MapR etc) of the one of the Big Data frameworks. Some of the companies like Facebook/Yahoo which have enough resources to hire a developer who can go through the internals of the Big Data framework for any changes if required or go through the pain of integrating different frameworks will go with the Apache version. And a company with limited resources and would like to mostly focus their energy on the business side will mostly go with a commercial version of the Big Data frameworks.

There are both pros and cons of going the Apache way or the commercial way, which have to be evaluated based on the requirements and the amount of resources available for the Big Data initiative.