Wednesday, April 16, 2014

Microsoft commoditizing Big Data

Though I had been an Ubuntu buff for quite some time, I have to accept that Microsoft products work out of the box with little tweaking for good or bad. I had to tweak Ubuntu every once in a while to make the work flows faster and quicker for the tasks I had been doing repeatedly. Also, the learning curve is a bit steep for Linux when compared to Windows platform. Things are slowly changing from a Linux perspective, with Ubuntu leading the effort. But, once you get used to Linux, it's very difficult to go back.

With the Microsoft CEO talking about Big Data (1, 2), it's definite that the whole company will rally (embrace and extend) around Big Data whole heartedly. Microsoft will repeat history with Big Data as done with their other products (commoditized, buggy and easy to use).

Installing/configuring/integrating Big Data products is still a pain, despite the various commercial companies working around Big Data. Getting Microsoft into the Big Data space will keep the different Big Data vendors on their toes and we can expect much more innovation in the usability space.
More lego dudes! by Sappymoosetree from Flickr under CC
Note (18th April, 2014): Here is an article from readwrite resonating the same.

Saturday, April 12, 2014

Heartbleed explanation as a comic

For the impatient and for those less into technology, here is a nice comic on Heartbleed (1, 2). As seen in the 5th box below, the server returns some additional data from the memory which might contain private keys, passwords, logs etc because the number of letters sent (in HAT) doesn't match with the number (500) which the client claims to send.

Friday, April 11, 2014

Updates on Spark

In the earlier blog `Is it Spark 'vs' OR 'and' Hadoop?`, we looked at Spark and Hadoop at a very high level. MapReduce is batch oriented in nature, so is not well suited for some types of jobs. Spark is well suited for iterative processing as well as for interactive analysis.

For any new framework to be adopted within an enterprise, it's very important that commercial support be available for the framework unless there is enough expertise within the enterprise to know the internal details on how a particular framework works and be able to support it. Cloudera formed a partnership with Databricks (1) October, 2013 and has included Spark (2) in the CDH distribution. Here (3) is Cloudera's vision on Spark. Cloudera put it succinctly by mentioning

The leading candidate for “successor to MapReduce” today is Apache Spark. Like MapReduce, it is a general-purpose engine, but it is designed to run many more workloads, and to do so much faster than the older system.

Now it's MapR turn to form a partnership with Databricks (4) and include it in the MapR distributions (5). There is a lot of traction around Spark, but more and more inclusions of Spark in the commercial distributions will make the foundation of Spark more solid. Spark can be installed/configured on HortonWorks distribution manually (6, 7), but there is no partnership between Databricks and HortonWorks as with Cloudera and MapR.
It's not Hadoop is going away, but Hadoop 2x will be used as the underlying platform for storing huge amounts of data through HDFS and also for running different computing models using YARN (8).

One interesting fact from the Apache blog (9) is that Spark was created in 2009 by at the University of California at Berkeley's AMPLab. So, it took almost 5 years to get to the current state.

Spark has one of the best documentation (10) around open source. Fast Data Processing with Spark is the only book around Spark as of now and covers installing and using Spark. Also, looks like Learning Spark: Lightning-fast big data analytics book is still a work-in-progress and will be released end of 2014.

Wednesday, April 9, 2014

Big Data and Security

Sometime back I accidentally came across a security blog by Brain Kerbs, called Krebs on Security. Recently BusinessWeek published an article on him and Sony plans to make a movie of the same. He was the one who broke the Target security breach here. Once I started following his blog, I began thinking twice before spending online using my Credit Card.
Security Circus by Alexandre Dulaunoy from Flickr under CC
So, why all of a sudden a discussion about security now. With the Big Data tools like Hadoop getting commoditized, enterprises have an option to store more and more data with less thought on the implications. The more the data stored, the more we have to think about authentication, authorization and auditing.

If someone follows the Big Data space closely, new features are added to the existing frameworks and very often new frameworks are introduced, but very few of them revolve around security. The awareness around Big Data security is also less, the Big Data vendors should take up the additional responsibility of educating the end users around security by publishing more blogs, articles, webinars, code example and what ever means possible.

In an upcoming blog, I will blog about the current state of the ecosystem around Big Data security. To quote Uncle Ben `With great power comes great responsibility` and we need to be more and more responsible around data.

Note (19th April, 2014) : Here is a short interview of Brain Kerbs in CNN.

Monday, April 7, 2014

Intel and Cloudera Partnership

Intel invests (1, 2) $740 million in Cloudera taking 18% stake in it. Intel would be dropping it's distribution of Hadoop in favor of CDH and the CDH bits will be more and more optimized for the Intel processor. Not sure at what level in the technology stack the optimization would be done, as most of the Big Data frameworks have been developed in Java except for a few like Impala. The Intel Hadoop site still points to the `Intel Distribution for Apache Hadoop Software` and not to the distribution from Cloudera.

This partnership makes sense from Intel perspective as more and more data centers are experimenting with AMD processors. From the Cloudera perspective, they would be getting a new distribution channel.

There had been lot of companies around Big Data because of the hype and the open source nature of the Big Data tools. It's good to see consolidation happening around these companies, with the big companies acquiring or forming partnership with the small companies. Consolidation helps to weed out the weaker companies leaving the stronger ones.
Shaking Hands by zeevveez from Flickr under CC

Thursday, April 3, 2014

Oozie High Availability

In the earlier blog entries, we looked at how to install/configure Oozie, create and submit a simple work flow and finally execute the work flow at regular intervals of time.
Oozie work flows are written in HPDL (Hadoop Process Definition Language) using Hue or as simple as using a notepad. Note that writing HPDL is not easy and so using Hue would be easiest approach as it automatically generates the HPDL xml code. Oozie Client submits the work flow definition to the Oozie Server which in turn starts different actions as defined in the work flow definition.

As seen in the above diagram, the Oozie Server is a single point of failure. Oozie now supports HA (active-active Oozie Server) and the feature has been included in CDH 5. Here are the instructions for configuring Oozie in HA mode and here are more details about the Oozie HA feature from Cloudera.

Monday, March 31, 2014

What is a Big Data cluster?

Very often I get the query `What is a cluster?` when discussing about Hadoop and Big Data. To keep it simple `A cluster is a group or a network of machines wired together acting a single entity to work on a task which when run on a single machine takes much more longer time.` The given task is split and processed by multiple machines in parallel and so that the task gets completed faster. Jesse Johnson puts it in simple and clear terms what a cluster is all about and how to design distributed algorithms here.
IMG_9370 by NeoSpire from Flickr under CC
In a Big Data cluster, the machines (or nodes) are neither as powerful as a server grade machine nor as dumb as a desktop machine. Having multiple (like in thousands) server grade machines doesn't make sense from a cost perspective, while a Desktop grade machine fails often which has to be appropriately handled. Big Data clusters have a collection of commodity machines which fall in between a server and a desktop grade machine.

Similar to open source software projects like Hadoop and others, Facebook started Open Computer Project around computing infrastructure. Facebook doesn't see any edge from their competitors by having a specialized and distinguished hardware from the rest and has been opening some of it's internal infrastructure designs. Anyone can take a design, modify the same and come up with their own hardware.

I am not into much of hardware, but it makes sense if the different data centers (like those from Amazon, Google, Microsoft and others) have a common specification around hardware as it brings down the data center building cost due to the scale of manufacturing and the R&D costs. It's very similar to what had been happening in the Apache and the Linux space, different companies work together is a collaborative environment on a common goal to make software better and enjoy the benefits of the same.