Big Data and Cloud Tips: April 2014

Tuesday, April 29, 2014

Wanted interns to work on Big Data Technologies

We are looking for interns to work with us on some of the Big Data technologies at Hyderabad, India. The pay would be appropriate. The intern preferably should be from a Computer Science background, be really passionate about learning new technologies and be ready to stretch a bit hard. The intern under our guidance would be performing installing/configuring/tuning of Linux OS and all the way to the Hadoop and related Big Data frameworks on the cluster. Once the Hadoop cluster has been setup, we have got a couple of ideas which we would be implementing on the same cluster.

The immediate advantage is that the intern would be working on one of the current hot technology and would have direct access to us to know/learn more about Big Data. Also, based on the requirement appropriate training would be given around Big Data. Also, the work being done by the intern will definitely help in getting them through the different Cloudera Certifications.

BTW, we are looking for someone who can work with us full time and not part time. If you or anyone you know is interested in taking an internship please send an email with CV at praveensripati@gmail.com.

Monday, April 28, 2014

Big Data Webinars

Big Data is a moving target with new companies, frameworks and features get introduced all the time. It's getting more and more difficult to keep in pace with Big Data. There is nothing more relaxing than sitting in a chair and watching webinars on some of the latest technologies.

Brighton Beach by appoose81 from Flickr under CC

So, here (1) is a calendar (XML, ICAL, HTML) with few of the upcoming webinars around Big Data. I would be populating more and more events in the calendar as they get planned. Those interested can import the calendar in Thunderbird, Outlook or some other calendar application and keep updated with the webinars in the Big Data space.

If you are interested in including any webinar around Big Data in the calendar, then let me know at praveensripati@gmail.com.

http://www.thecloudavenue.com/p/big-data-webinars.html

Saturday, April 26, 2014

Automating things using IFTTT

I am big fan of automation and recently was looking for a way to automatically tweet when I publish a new blog. Found ifttt.com which allows to create recipes like these and share with others. The recipes run every 15 minutes, so there is a delay of maximum 15 minutes between publishing a new blog and getting it posted into Twitter.

The acronym IFTTT is a bit cryptic to remember and expands to `IF This Then That`. The service had been running for almost 4 years, but had been a bit flaky when using it.

Was not able to create recipes for the first time and was also not able to add LinkedIn as a channel. Also, it's not possible to create multiple channels of the same type. For example, the new blog event cannot be send to multiple Twitter account. I had to create multiple accounts with IFTTT so as to add multiple Twitter channels. Also, creating multiple triggers for a single recipe is not possible for now. Also, it would be nice to have some complex recipes like if-then-else and others.

The recipes has already been created for any new blog posted here to be tweeted, need to wait and see if this post gets tweeted or not.

Friday, April 25, 2014

Nice video on Machine Learning with Python

When I started looking around Big Data/Hadoop couple of years back, the amount of information to get started was sparse and I had to spend a lot of time to be comfortable with Big Data. Now the problem is that there is too much information/noise and it's taking time to separate the good from the bad. RSS aggregators (1) are making it even more difficult.

Anyway, above is a nice video to get started with Machine Learning and Python. There are a lot of things happening in R and Python space in the context of Machine Learning, but this session is a bit biased towards Python and covers from what Machine Learning is to all about to showing Python code for a simple use case. It's worth the time watching the video.

Wednesday, April 23, 2014

Screencast for submitting a job to a cluster

In an earlier blog (1) we looked at how to develop a simple MapReduce program in Eclipse on a Linux machine. Here (1) is another screencast on submitting a word count MapReduce program written in Python. For some reason Windows Media Player is not able to play the file, but VLC is able to.

Here is the code for the mapper (1) and the reducer (1). Note that Hadoop provides Streaming (1) feature for writing MapReduce programs in non Java languages.

https://dl.dropboxusercontent.com/u/3182023/Screencasts/Execute-WordCount-PythonMR-In-VirtualMachine.mp4

As observed in the screencast the Virtual Machine (VM) has all the necessary pieces to get easily started with Big Data and is provided as part of the Big Data training (1). The VM is updated regularly to add new frameworks and update the existing ones.

Hadoop tries to hide the underlying infrastructure details, so the MapReduce code and the command to submit a job is all the same for a single node and a thousand node cluster.

Tuesday, April 22, 2014

Screencast for developing MapReduce programs in Eclipse

In one of the earlier blog (1), we looked at how to develop MapReduce programs in Eclipse. Here (Develop-MR-Program-In-Eclipse.mp4) is the screen cast for the same. The screencast has been recorded by using Kazam on Ubuntu 14.04 in the mp4 format. For some reason Windows Media Player is not able to play the file, but VLC is able to.

https://dl.dropboxusercontent.com/u/3182023/Screencasts/Develop-MR-Program-In-Eclipse.mp4

The process in the screencast only work on Linux and is with the assumption that Java and Eclipse have been already installed. Here (1, 2) are the instructions if Java is not installed. Also, Hadoop (hadoop-1.2.1.tar.gz) has to be downloaded from here (1) and extracted to a folder. The code for the WordCount MapReduce Java program can either be written or can be downloaded here (1).

Wednesday, April 16, 2014

Microsoft commoditizing Big Data

Though I had been an Ubuntu buff for quite some time, I have to accept that Microsoft products work out of the box with little tweaking for good or bad. I had to tweak Ubuntu every once in a while to make the work flows faster and quicker for the tasks I had been doing repeatedly. Also, the learning curve is a bit steep for Linux when compared to Windows platform. Things are slowly changing from a Linux perspective, with Ubuntu leading the effort. But, once you get used to Linux, it's very difficult to go back.

With the Microsoft CEO talking about Big Data (1, 2), it's definite that the whole company will rally (embrace and extend) around Big Data whole heartedly. Microsoft will repeat history with Big Data as done with their other products (commoditized, buggy and easy to use).

Installing/configuring/integrating Big Data products is still a pain, despite the various commercial companies working around Big Data. Getting Microsoft into the Big Data space will keep the different Big Data vendors on their toes and we can expect much more innovation in the usability space.

More lego dudes! by Sappymoosetree from Flickr under CC

Note (18th April, 2014): Here is an article from readwrite resonating the same.

Saturday, April 12, 2014

Heartbleed explanation as a comic

For the impatient and for those less into technology, here is a nice comic on Heartbleed (1, 2). As seen in the 5th box below, the server returns some additional data from the memory which might contain private keys, passwords, logs etc because the number of letters sent (in HAT) doesn't match with the number (500) which the client claims to send.

Friday, April 11, 2014

Updates on Spark

In the earlier blog `Is it Spark 'vs' OR 'and' Hadoop?`, we looked at Spark and Hadoop at a very high level. MapReduce is batch oriented in nature, so is not well suited for some types of jobs. Spark is well suited for iterative processing as well as for interactive analysis.

For any new framework to be adopted within an enterprise, it's very important that commercial support be available for the framework unless there is enough expertise within the enterprise to know the internal details on how a particular framework works and be able to support it. Cloudera formed a partnership with Databricks (1) October, 2013 and has included Spark (2) in the CDH distribution. Here (3) is Cloudera's vision on Spark. Cloudera put it succinctly by mentioning

The leading candidate for “successor to MapReduce” today is Apache Spark. Like MapReduce, it is a general-purpose engine, but it is designed to run many more workloads, and to do so much faster than the older system.

Now it's MapR turn to form a partnership with Databricks (4) and include it in the MapR distributions (5). There is a lot of traction around Spark, but more and more inclusions of Spark in the commercial distributions will make the foundation of Spark more solid. Spark can be installed/configured on HortonWorks distribution manually (6, 7), but there is no partnership between Databricks and HortonWorks as with Cloudera and MapR.

It's not Hadoop is going away, but Hadoop 2x will be used as the underlying platform for storing huge amounts of data through HDFS and also for running different computing models using YARN (8).

One interesting fact from the Apache blog (9) is that Spark was created in 2009 by at the University of California at Berkeley's AMPLab. So, it took almost 5 years to get to the current state.

Spark has one of the best documentation (10) around open source. Fast Data Processing with Spark is the only book around Spark as of now and covers installing and using Spark. Also, looks like Learning Spark: Lightning-fast big data analytics book is still a work-in-progress and will be released end of 2014.

Note (2nd May, 2014) : Now Hortonworks is also including Spark in HDP. More here.

Wednesday, April 9, 2014

Big Data and Security

Sometime back I accidentally came across a security blog by Brain Kerbs, called Krebs on Security. Recently BusinessWeek published an article on him and Sony plans to make a movie of the same. He was the one who broke the Target security breach here. Once I started following his blog, I began thinking twice before spending online using my Credit Card.

Security Circus by Alexandre Dulaunoy from Flickr under CC

So, why all of a sudden a discussion about security now. With the Big Data tools like Hadoop getting commoditized, enterprises have an option to store more and more data with less thought on the implications. The more the data stored, the more we have to think about authentication, authorization and auditing.

If someone follows the Big Data space closely, new features are added to the existing frameworks and very often new frameworks are introduced, but very few of them revolve around security. The awareness around Big Data security is also less, the Big Data vendors should take up the additional responsibility of educating the end users around security by publishing more blogs, articles, webinars, code example and what ever means possible.

In an upcoming blog, I will blog about the current state of the ecosystem around Big Data security. To quote Uncle Ben `With great power comes great responsibility` and we need to be more and more responsible around data.

Note (19th April, 2014) : Here is a short interview of Brain Kerbs in CNN.

Monday, April 7, 2014

Intel and Cloudera Partnership

Intel invests (1, 2) $740 million in Cloudera taking 18% stake in it. Intel would be dropping it's distribution of Hadoop in favor of CDH and the CDH bits will be more and more optimized for the Intel processor. Not sure at what level in the technology stack the optimization would be done, as most of the Big Data frameworks have been developed in Java except for a few like Impala. The Intel Hadoop site still points to the `Intel Distribution for Apache Hadoop Software` and not to the distribution from Cloudera.

This partnership makes sense from Intel perspective as more and more data centers are experimenting with AMD processors. From the Cloudera perspective, they would be getting a new distribution channel.

There had been lot of companies around Big Data because of the hype and the open source nature of the Big Data tools. It's good to see consolidation happening around these companies, with the big companies acquiring or forming partnership with the small companies. Consolidation helps to weed out the weaker companies leaving the stronger ones.

Shaking Hands by zeevveez from Flickr under CC

Thursday, April 3, 2014

Oozie High Availability

In the earlier blog entries, we looked at how to install/configure Oozie, create and submit a simple work flow and finally execute the work flow at regular intervals of time.

Oozie work flows are written in HPDL (Hadoop Process Definition Language) using Hue or as simple as using a notepad. Note that writing HPDL is not easy and so using Hue would be easiest approach as it automatically generates the HPDL xml code. Oozie Client submits the work flow definition to the Oozie Server which in turn starts different actions as defined in the work flow definition.

As seen in the above diagram, the Oozie Server is a single point of failure. Oozie now supports HA (active-active Oozie Server) and the feature has been included in CDH 5. Here are the instructions for configuring Oozie in HA mode and here are more details about the Oozie HA feature from Cloudera.

Pages