Friday, April 25, 2014

Nice video on Machine Learning with Python

When I started looking around Big Data/Hadoop couple of years back, the amount of information to get started was sparse and I had to spend a lot of time to be comfortable with Big Data. Now the problem is that there is too much information/noise and it's taking time to separate the good from the bad. RSS aggregators (1) are making it even more difficult.
Anyway, above is a nice video to get started with Machine Learning and Python. There are a lot of things happening in R and Python space in the context of Machine Learning, but this session is a bit biased towards Python and covers from what Machine Learning is to all about to showing Python code for a simple use case. It's worth the time watching the video.

Wednesday, April 23, 2014

Screencast for submitting a job to a cluster

In an earlier blog (1) we looked at how to develop a simple MapReduce program in Eclipse on a Linux machine. Here (1) is another screencast on submitting a word count MapReduce program written in Python. For some reason Windows Media Player is not able to play the file, but VLC is able to.

Here is the code for the mapper (1) and the reducer (1). Note that Hadoop provides Streaming (1) feature for writing MapReduce programs in non Java languages.
https://dl.dropboxusercontent.com/u/3182023/Screencasts/Execute-WordCount-PythonMR-In-VirtualMachine.mp4
As observed in the screencast the Virtual Machine (VM) has all the necessary pieces to get easily started with Big Data and is provided as part of the Big Data training (1). The VM is updated regularly to add new frameworks and update the existing ones.

Hadoop tries to hide the underlying infrastructure details, so the MapReduce code and the command to submit a job is all the same for a single node and a thousand node cluster.

Tuesday, April 22, 2014

Screencast for developing MapReduce programs in Eclipse

In one of the earlier blog (1), we looked at how to develop MapReduce programs in Eclipse. Here (Develop-MR-Program-In-Eclipse.mp4) is the screen cast for the same. The screencast has been recorded by using Kazam on Ubuntu 14.04 in the mp4 format. For some reason Windows Media Player is not able to play the file, but VLC is able to.
https://dl.dropboxusercontent.com/u/3182023/Screencasts/Develop-MR-Program-In-Eclipse.mp4
The process in the screencast only work on Linux and is with the assumption that Java and Eclipse have been already installed. Here (1, 2) are the instructions if Java is not installed. Also, Hadoop (hadoop-1.2.1.tar.gz) has to be downloaded from here (1) and extracted to a folder. The code for the WordCount MapReduce Java program can either be written or can be downloaded here (1).

Wednesday, April 16, 2014

Microsoft commoditizing Big Data

Though I had been an Ubuntu buff for quite some time, I have to accept that Microsoft products work out of the box with little tweaking for good or bad. I had to tweak Ubuntu every once in a while to make the work flows faster and quicker for the tasks I had been doing repeatedly. Also, the learning curve is a bit steep for Linux when compared to Windows platform. Things are slowly changing from a Linux perspective, with Ubuntu leading the effort. But, once you get used to Linux, it's very difficult to go back.

With the Microsoft CEO talking about Big Data (1, 2), it's definite that the whole company will rally (embrace and extend) around Big Data whole heartedly. Microsoft will repeat history with Big Data as done with their other products (commoditized, buggy and easy to use).

Installing/configuring/integrating Big Data products is still a pain, despite the various commercial companies working around Big Data. Getting Microsoft into the Big Data space will keep the different Big Data vendors on their toes and we can expect much more innovation in the usability space.
More lego dudes! by Sappymoosetree from Flickr under CC
Note (18th April, 2014): Here is an article from readwrite resonating the same.

Saturday, April 12, 2014

Heartbleed explanation as a comic

For the impatient and for those less into technology, here is a nice comic on Heartbleed (1, 2). As seen in the 5th box below, the server returns some additional data from the memory which might contain private keys, passwords, logs etc because the number of letters sent (in HAT) doesn't match with the number (500) which the client claims to send.

Friday, April 11, 2014

Updates on Spark

In the earlier blog `Is it Spark 'vs' OR 'and' Hadoop?`, we looked at Spark and Hadoop at a very high level. MapReduce is batch oriented in nature, so is not well suited for some types of jobs. Spark is well suited for iterative processing as well as for interactive analysis.

For any new framework to be adopted within an enterprise, it's very important that commercial support be available for the framework unless there is enough expertise within the enterprise to know the internal details on how a particular framework works and be able to support it. Cloudera formed a partnership with Databricks (1) October, 2013 and has included Spark (2) in the CDH distribution. Here (3) is Cloudera's vision on Spark. Cloudera put it succinctly by mentioning

The leading candidate for “successor to MapReduce” today is Apache Spark. Like MapReduce, it is a general-purpose engine, but it is designed to run many more workloads, and to do so much faster than the older system.

Now it's MapR turn to form a partnership with Databricks (4) and include it in the MapR distributions (5). There is a lot of traction around Spark, but more and more inclusions of Spark in the commercial distributions will make the foundation of Spark more solid. Spark can be installed/configured on HortonWorks distribution manually (6, 7), but there is no partnership between Databricks and HortonWorks as with Cloudera and MapR.
It's not Hadoop is going away, but Hadoop 2x will be used as the underlying platform for storing huge amounts of data through HDFS and also for running different computing models using YARN (8).

One interesting fact from the Apache blog (9) is that Spark was created in 2009 by at the University of California at Berkeley's AMPLab. So, it took almost 5 years to get to the current state.

Spark has one of the best documentation (10) around open source. Fast Data Processing with Spark is the only book around Spark as of now and covers installing and using Spark. Also, looks like Learning Spark: Lightning-fast big data analytics book is still a work-in-progress and will be released end of 2014.

Wednesday, April 9, 2014

Big Data and Security

Sometime back I accidentally came across a security blog by Brain Kerbs, called Krebs on Security. Recently BusinessWeek published an article on him and Sony plans to make a movie of the same. He was the one who broke the Target security breach here. Once I started following his blog, I began thinking twice before spending online using my Credit Card.
Security Circus by Alexandre Dulaunoy from Flickr under CC
So, why all of a sudden a discussion about security now. With the Big Data tools like Hadoop getting commoditized, enterprises have an option to store more and more data with less thought on the implications. The more the data stored, the more we have to think about authentication, authorization and auditing.

If someone follows the Big Data space closely, new features are added to the existing frameworks and very often new frameworks are introduced, but very few of them revolve around security. The awareness around Big Data security is also less, the Big Data vendors should take up the additional responsibility of educating the end users around security by publishing more blogs, articles, webinars, code example and what ever means possible.

In an upcoming blog, I will blog about the current state of the ecosystem around Big Data security. To quote Uncle Ben `With great power comes great responsibility` and we need to be more and more responsible around data.

Note (19th April, 2014) : Here is a short interview of Brain Kerbs in CNN.