Thursday, May 22, 2014

Pig as a Service: Hadoop challenges data warehouses

Thanks to Gil Allouche (Qubole's VP of Marketing) for this post.

Hadoop and its ecosystem has evolved from a narrow map-reduced architecture to a universal data platform set to dominate the data processing landscape in the future. Importantly, the push to simplify Hadoop deployments with managed cloud services known as Hadoop-as-a-Service is increasing Hadoop’s appeal to new data projects and architectures. Naturally, the development is permeating the Hadoop ecosystem in shape of Pig as a Service offerings, for example.

Pig, developed by Yahoo research in 2006, enables programmers to write data transformation programs for Hadoop quickly and easily without the cost and complexity of map-reduce programs. Consequently, ETL (Extract, Transform, Load), the core workload of DWH (data warehouse) solutions, is often realized with Pig in the Hadoop environment. The business case for Hadoop and Pig as a Service is very compelling from financial and technical perspectives.

Hadoop is becoming data’s Swiss Army knife
The news on Hadoop last year have been dominated by SQL (Structured Query language) on Hadoop with Hive, Presto, Impala, Drill, and countless other flavours competing on making big data accessible to business users. Most of these solutions are supported directly by Hadoop distributors, e.g. Hortonworks, MapR, Cloudera, and cloud service providers, e.g. Amazon and Qubole.

The push for development in the area is driven by the vision for Hadoop to become the data platform of the future. The release of Hadoop 2.0 with YARN (Yet Another Resource Negotiator) last year was an important step. It turned the core of Hadoop’s processing architecture from a map-reduce centric solution into a generic cluster resource management tool able to run any kind of algorithm and application. Hadoop solution providers are now racing to capture the market for multipurpose, any-size data processing. SQL on Hadoop is only one of the stepping-stones to this goal.

Friday, May 16, 2014

User recommendations using Hadoop, Flume, HBase and Log4J - Part 2

Thanks to Srinivas Kummarapu for this post on how to show the appropriate recommendations to a web user based on the user activity in the past.

In the previous blog we have seen how to Flume the user activities into the Hadoop cluster. On top of these user activities some analysis can be done to figure out what a particular user is interested in.

For example if a user wants to buy a mobile from a shopping site and ended up buying none, we got all his activities into Hadoop cluster on which analysis can be done to figure out what type of phones that particular user is interested in. The interested phones can be recommended when the user visits the site again.

The user activities in the HBase consists of only mobile name and no more details. More details about the mobile phone can be maintained in a RDBMS. We need to do join the RDBMS data (mobile details) with the HBase to send the information to the Recommendations tables of RDBMS in order to recommend the user.

Here we have two options to perform Joins.

1) Send the result of the Hadoop cluster to RDBMS and do Joins there.
2) Get the RDBMS data into HBase to perform join in parallel distributed fashion.

Both can be done by a Map-Only Jobs tool called Sqoop (SQl to haOOP).
In this article we will see how to Sqoop the RDBMS table into the HBase database in an incremental fashion.

Friday, May 9, 2014

User recommendations using Hadoop, Flume, HBase and Log4J - Part 1

Thanks to Srinivas Kummarapu for this post on how to show the appropriate recommendations to a web user based on the user activity in the past.

This first of a four part article is with the assumption that Hadoop, Flume, HBase and Log4J have been already installed. In this article we will see how to track the user activities and dump it into HDFS and HBase. In the future articles, we will look into some kind of basket analysis from the data in HDFS/HBase and will project the same to the transaction database for recommendations. Also, refer this article to Flume the data into HDFS.

Friday, May 2, 2014

Looking for guest bloggers at

The first entry had been posted on 28th September, 2011 on this blog. Initially I started blogging as an experiment, but lately I had been having fun and liking to blog.

Not only the traffic to the blog had been increasing at a very good pace, but also I had been making quite a few acquaintances and also getting a lot of nice and interesting opportunities through the blog. I got offers to write a book, an article, blog on some other sites and others.

I am looking for guest bloggers to this blog. If you or someone else is interested then please let me know

a) a bit about yourself (along with LinkedIn profile)
b) topics you are interested in to write on this blog
c) references to articles written in the past if any
I don't want to put a lot of restrictions around this, but here are a few

a) the article should be authentic
b) no affiliate or promotional links to be included
c) the article can appear elsewhere after 10 days with a back link to the original

I am open to any topics around Big Data, but here are some of the topics I would be interested in

a) a use case on how you company/startup is using Big Data
b) using R/Python/Mahout/Weka for some interesting data processing
c) integrating different open source frameworks
d) comparing different open source frameworks with similar functionalities
e) ideas and implementation of pet projects or POC (Proof Of Concepts)
f) best practices and recommendation
g) views/opinions of different open source framework

As a bonus, if a blog gets posted here then it will also include a brief introduction about the author and a link to his/her LinkedIn profile. This will give enough publicity for the author.

If you are a rookie and writing for the first time, that shouldn't be a problem. Everything begins with a simple start. Please let me know at if you are interested in blogging here.

Tuesday, April 29, 2014

Wanted interns to work on Big Data Technologies

We are looking for interns to work with us on some of  the Big Data technologies at Hyderabad, India. The pay would be appropriate. The intern preferably should be from a Computer Science background, be really passionate about learning new technologies and be ready to stretch a bit hard. The intern under our guidance would be performing installing/configuring/tuning of Linux OS and all the way to the Hadoop and related Big Data frameworks on the cluster. Once the Hadoop cluster has been setup, we have got a couple of ideas which we would be implementing on the same cluster.
The immediate advantage is that the intern would be working on one of the current hot technology and would have direct access to us to know/learn more about Big Data. Also, based on the requirement appropriate training would be given around Big Data. Also, the work being done by the intern will definitely help in getting them through the different Cloudera Certifications.

BTW, we are looking for someone who can work with us full time and not part time. If you or anyone you know is interested in taking an internship please send an email with CV at

Monday, April 28, 2014

Big Data Webinars

Big Data is a moving target with new companies, frameworks and features get introduced all the time. It's getting more and more difficult to keep in pace with Big Data. There is nothing more relaxing than sitting in a chair and watching webinars on some of the latest technologies.
Brighton Beach by appoose81 from  Flickr under CC
So, here (1) is a calendar (XML, ICAL, HTML) with few of the upcoming webinars around Big Data. I would be populating more and more events in the calendar as they get planned. Those interested can import the calendar in Thunderbird, Outlook or some other calendar application and keep updated with the webinars in the Big Data space.

If you are interested in including any webinar around Big Data in the calendar, then let me know at

Saturday, April 26, 2014

Automating things using IFTTT

I am big fan of automation and recently was looking for a way to automatically tweet when I publish a new blog. Found which allows to create recipes like these and share with others. The recipes run every 15 minutes, so there is a delay of maximum 15 minutes between publishing a new blog and getting it posted into Twitter.
The acronym IFTTT is a bit cryptic to remember and expands to `IF This Then That`. The service had been running for almost 4 years, but had been a bit flaky when using it.

Was not able to create recipes for the first time and was also not able to add LinkedIn as a channel. Also, it's not possible to create multiple channels of the same type. For example, the new blog event cannot be send to multiple Twitter account. I had to create multiple accounts with IFTTT so as to add multiple Twitter channels. Also, creating multiple triggers for a single recipe is not possible for now. Also, it would be nice to have some complex recipes like if-then-else and others.

The recipes has already been created for any new blog posted here to be tweeted, need to wait and see if this post gets tweeted or not.

Friday, April 25, 2014

Nice video on Machine Learning with Python

When I started looking around Big Data/Hadoop couple of years back, the amount of information to get started was sparse and I had to spend a lot of time to be comfortable with Big Data. Now the problem is that there is too much information/noise and it's taking time to separate the good from the bad. RSS aggregators (1) are making it even more difficult.
Anyway, above is a nice video to get started with Machine Learning and Python. There are a lot of things happening in R and Python space in the context of Machine Learning, but this session is a bit biased towards Python and covers from what Machine Learning is to all about to showing Python code for a simple use case. It's worth the time watching the video.

Wednesday, April 23, 2014

Screencast for submitting a job to a cluster

In an earlier blog (1) we looked at how to develop a simple MapReduce program in Eclipse on a Linux machine. Here (1) is another screencast on submitting a word count MapReduce program written in Python. For some reason Windows Media Player is not able to play the file, but VLC is able to.

Here is the code for the mapper (1) and the reducer (1). Note that Hadoop provides Streaming (1) feature for writing MapReduce programs in non Java languages.
As observed in the screencast the Virtual Machine (VM) has all the necessary pieces to get easily started with Big Data and is provided as part of the Big Data training (1). The VM is updated regularly to add new frameworks and update the existing ones.

Hadoop tries to hide the underlying infrastructure details, so the MapReduce code and the command to submit a job is all the same for a single node and a thousand node cluster.

Tuesday, April 22, 2014

Screencast for developing MapReduce programs in Eclipse

In one of the earlier blog (1), we looked at how to develop MapReduce programs in Eclipse. Here (Develop-MR-Program-In-Eclipse.mp4) is the screen cast for the same. The screencast has been recorded by using Kazam on Ubuntu 14.04 in the mp4 format. For some reason Windows Media Player is not able to play the file, but VLC is able to.
The process in the screencast only work on Linux and is with the assumption that Java and Eclipse have been already installed. Here (1, 2) are the instructions if Java is not installed. Also, Hadoop (hadoop-1.2.1.tar.gz) has to be downloaded from here (1) and extracted to a folder. The code for the WordCount MapReduce Java program can either be written or can be downloaded here (1).

Wednesday, April 16, 2014

Microsoft commoditizing Big Data

Though I had been an Ubuntu buff for quite some time, I have to accept that Microsoft products work out of the box with little tweaking for good or bad. I had to tweak Ubuntu every once in a while to make the work flows faster and quicker for the tasks I had been doing repeatedly. Also, the learning curve is a bit steep for Linux when compared to Windows platform. Things are slowly changing from a Linux perspective, with Ubuntu leading the effort. But, once you get used to Linux, it's very difficult to go back.

With the Microsoft CEO talking about Big Data (1, 2), it's definite that the whole company will rally (embrace and extend) around Big Data whole heartedly. Microsoft will repeat history with Big Data as done with their other products (commoditized, buggy and easy to use).

Installing/configuring/integrating Big Data products is still a pain, despite the various commercial companies working around Big Data. Getting Microsoft into the Big Data space will keep the different Big Data vendors on their toes and we can expect much more innovation in the usability space.
More lego dudes! by Sappymoosetree from Flickr under CC
Note (18th April, 2014): Here is an article from readwrite resonating the same.

Saturday, April 12, 2014

Heartbleed explanation as a comic

For the impatient and for those less into technology, here is a nice comic on Heartbleed (1, 2). As seen in the 5th box below, the server returns some additional data from the memory which might contain private keys, passwords, logs etc because the number of letters sent (in HAT) doesn't match with the number (500) which the client claims to send.

Friday, April 11, 2014

Updates on Spark

In the earlier blog `Is it Spark 'vs' OR 'and' Hadoop?`, we looked at Spark and Hadoop at a very high level. MapReduce is batch oriented in nature, so is not well suited for some types of jobs. Spark is well suited for iterative processing as well as for interactive analysis.

For any new framework to be adopted within an enterprise, it's very important that commercial support be available for the framework unless there is enough expertise within the enterprise to know the internal details on how a particular framework works and be able to support it. Cloudera formed a partnership with Databricks (1) October, 2013 and has included Spark (2) in the CDH distribution. Here (3) is Cloudera's vision on Spark. Cloudera put it succinctly by mentioning

The leading candidate for “successor to MapReduce” today is Apache Spark. Like MapReduce, it is a general-purpose engine, but it is designed to run many more workloads, and to do so much faster than the older system.

Now it's MapR turn to form a partnership with Databricks (4) and include it in the MapR distributions (5). There is a lot of traction around Spark, but more and more inclusions of Spark in the commercial distributions will make the foundation of Spark more solid. Spark can be installed/configured on HortonWorks distribution manually (6, 7), but there is no partnership between Databricks and HortonWorks as with Cloudera and MapR.
It's not Hadoop is going away, but Hadoop 2x will be used as the underlying platform for storing huge amounts of data through HDFS and also for running different computing models using YARN (8).

One interesting fact from the Apache blog (9) is that Spark was created in 2009 by at the University of California at Berkeley's AMPLab. So, it took almost 5 years to get to the current state.

Spark has one of the best documentation (10) around open source. Fast Data Processing with Spark is the only book around Spark as of now and covers installing and using Spark. Also, looks like Learning Spark: Lightning-fast big data analytics book is still a work-in-progress and will be released end of 2014.

Note (2nd May, 2014) : Now Hortonworks is also including Spark in HDP. More here.

Wednesday, April 9, 2014

Big Data and Security

Sometime back I accidentally came across a security blog by Brain Kerbs, called Krebs on Security. Recently BusinessWeek published an article on him and Sony plans to make a movie of the same. He was the one who broke the Target security breach here. Once I started following his blog, I began thinking twice before spending online using my Credit Card.
Security Circus by Alexandre Dulaunoy from Flickr under CC
So, why all of a sudden a discussion about security now. With the Big Data tools like Hadoop getting commoditized, enterprises have an option to store more and more data with less thought on the implications. The more the data stored, the more we have to think about authentication, authorization and auditing.

If someone follows the Big Data space closely, new features are added to the existing frameworks and very often new frameworks are introduced, but very few of them revolve around security. The awareness around Big Data security is also less, the Big Data vendors should take up the additional responsibility of educating the end users around security by publishing more blogs, articles, webinars, code example and what ever means possible.

In an upcoming blog, I will blog about the current state of the ecosystem around Big Data security. To quote Uncle Ben `With great power comes great responsibility` and we need to be more and more responsible around data.

Note (19th April, 2014) : Here is a short interview of Brain Kerbs in CNN.

Monday, April 7, 2014

Intel and Cloudera Partnership

Intel invests (1, 2) $740 million in Cloudera taking 18% stake in it. Intel would be dropping it's distribution of Hadoop in favor of CDH and the CDH bits will be more and more optimized for the Intel processor. Not sure at what level in the technology stack the optimization would be done, as most of the Big Data frameworks have been developed in Java except for a few like Impala. The Intel Hadoop site still points to the `Intel Distribution for Apache Hadoop Software` and not to the distribution from Cloudera.

This partnership makes sense from Intel perspective as more and more data centers are experimenting with AMD processors. From the Cloudera perspective, they would be getting a new distribution channel.

There had been lot of companies around Big Data because of the hype and the open source nature of the Big Data tools. It's good to see consolidation happening around these companies, with the big companies acquiring or forming partnership with the small companies. Consolidation helps to weed out the weaker companies leaving the stronger ones.
Shaking Hands by zeevveez from Flickr under CC

Thursday, April 3, 2014

Oozie High Availability

In the earlier blog entries, we looked at how to install/configure Oozie, create and submit a simple work flow and finally execute the work flow at regular intervals of time.
Oozie work flows are written in HPDL (Hadoop Process Definition Language) using Hue or as simple as using a notepad. Note that writing HPDL is not easy and so using Hue would be easiest approach as it automatically generates the HPDL xml code. Oozie Client submits the work flow definition to the Oozie Server which in turn starts different actions as defined in the work flow definition.

As seen in the above diagram, the Oozie Server is a single point of failure. Oozie now supports HA (active-active Oozie Server) and the feature has been included in CDH 5. Here are the instructions for configuring Oozie in HA mode and here are more details about the Oozie HA feature from Cloudera.

Monday, March 31, 2014

What is a Big Data cluster?

Very often I get the query `What is a cluster?` when discussing about Hadoop and Big Data. To keep it simple `A cluster is a group or a network of machines wired together acting a single entity to work on a task which when run on a single machine takes much more longer time.` The given task is split and processed by multiple machines in parallel and so that the task gets completed faster. Jesse Johnson puts it in simple and clear terms what a cluster is all about and how to design distributed algorithms here.
IMG_9370 by NeoSpire from Flickr under CC
In a Big Data cluster, the machines (or nodes) are neither as powerful as a server grade machine nor as dumb as a desktop machine. Having multiple (like in thousands) server grade machines doesn't make sense from a cost perspective, while a Desktop grade machine fails often which has to be appropriately handled. Big Data clusters have a collection of commodity machines which fall in between a server and a desktop grade machine.

Similar to open source software projects like Hadoop and others, Facebook started Open Computer Project around computing infrastructure. Facebook doesn't see any edge from their competitors by having a specialized and distinguished hardware from the rest and has been opening some of it's internal infrastructure designs. Anyone can take a design, modify the same and come up with their own hardware.

I am not into much of hardware, but it makes sense if the different data centers (like those from Amazon, Google, Microsoft and others) have a common specification around hardware as it brings down the data center building cost due to the scale of manufacturing and the R&D costs. It's very similar to what had been happening in the Apache and the Linux space, different companies work together is a collaborative environment on a common goal to make software better and enjoy the benefits of the same.

Note (21st April, 2014) : Here is a nice article from ZDNet on how Facebook saved $$$ using the Open Compute Project.

Lipstick on Pig for monitoring and visualizing jobs

As mentioned in the previous blog, Pig and Hive are higher level abstractions on top of MapReduce. Given a task like joining of  two data sets, it's much more easier to join the data sets using Pig and Hive as it takes less coding effort when compared to MapReduce. So, many companies are going with Pig and Hive as they provide better developer productivity.

The problem with abstraction is that it gives less control on what can and cannot be done and debugging with higher abstraction is also difficult as it hides the underlying details. Same is the case with Pig and Hive also.

Some time back Netflix open sourced Lipstick. Google also recently published a blog entry recently around the same. Pig converts the PigLatin scripts into a DAG of MapReduce and the underlying MR data flows can be difficult to visualize. Lipstick enables developers to visualize and monitor the execution of the Pig data flows at a logical level (aka MR). Earlier, this had to be done using the log files or by looking at the MR Web console.
urban decay lipstick naked2 3 from Flickr by ldhren under CC
Netflix and Twitter had been very aggressive in open sourcing their internal projects. With so much choice around, there had not been a better time around software to take an idea from concept to realization. One of the main criteria for picking a framework or a software is the support provided by commercial vendors. A good percentage of the softwares around Big Data are free and can be put in production with minimal cost, but lack commercial support for the sake of lower downtime. Lipstick also falls under the same category. It has not been included in any of the commercial Big Data distributions like the one from Cloudera, Hortonworks, MapR and others. So, Lipstick has to be installed manually and patching (for any bugs/improvements) has to be taken care of by the end user.

In an upcoming blog, we will look into how to install and configure Lipstick on top of Pig.

Friday, March 28, 2014

Mahout and MR

There has been a active discussion (1, 2, 3) in the Mahout Dev mailing list about the goals for Mahout 1.0 and also moving the underlying computation engine from MR to Spark or H20.  But as mentioned in the GigaOM article `Apache Mahout, Hadoop’s original machine learning project, is moving on from MapReduce`, the community hasn't yet decided yet.

As mentioned in the earlier blogs here, MR is by default batch oriented in nature and is also not suited for iterative processing and implementing Machine Learning algorithms as processing with MR involves R/Ws to HDFS after each step in the iteration. Mahout is pretty much tied to MR, though it's not impossible to rewrite the underlying MR algorithms, it's also not an easy task. It would be the right direction for the Mahout project to move to some non-MR platform and the sooner the better.

With the announcement of Oryx from Cloudera, we can expect quick progress around the distributed Machine Learning frameworks.
Directions by MShades From Flickr under CC

Friday, March 7, 2014

Is it Spark 'vs' OR 'and' Hadoop?

There had been a lot of traction around Spark. Cloudera announced (1, 2) it being a part of the CDH distribution and here is there stance on `MR and Spark`. MapReduce is a programming model for distribution computing while Spark is a framework or a Software. The essence of the Cloudera article is accurate, but the blog title is a bit misleading. It should be Hadoop and Spark.

Here is another interesting article from Cloudera on Spark from a Data Science perspective. Mahout implements Machine Learning algorithms in a MR fashion. But, MR is batch oriented (high latency and high throughput) in nature and doesn't suit well for Machine Learning algorithms which are iterative in nature. There is a active discussion in the Mahout community for decoupling the ML algorithms from MR where possible. It is a long shot, but the sooner the effort the better as alternate distributed ML frameworks are cropping.

With Spark becoming a top level project and lot of attention to Spark, one might think what will happen to Hadoop that Spark is gaining all the attention. So, this blog is all about it.
But, before jumping into Spark, let's look at how Hadoop has evolved. Hadoop 1 has two components HDFS (for storing the data at large scale) and MR (for processing the data at large scale) which operate on a group of commodity machines and act like a single cluster (or an entity) to work on a very big task. In Hadoop 2 (aka YARN, Next Generation Architecture, MRv2), Hadoop additionally constitutes YARN which has a centralized Resource Manager which allows multiple computing models (powered by YARN) to run on the same cluster. Executing multiple models on the same cluster will increase the utilization of the cluster and mostly flatten the usage of cluster.

Along, the same lines Spark applications can also run a resource/cluster management framework like YARN and Apache Mesos. So, it's not a matter of making a choice between Spark and Hadoop, but Spark can run on top of YARN (Hadoop 2) and can also use HDFS as a source for data. More on the different Spark execution models from the official documentation here.
With so much traction happening around Spark, it's interesting to see how work loads gets moved from some of the other frameworks like Hadoop (MR) to Spark. Note that MR is a distributing computing model and can be implemented on a distributed computing model like Spark. Here is a simple Scala program for WordCount on top of Spark.
file = spark.textFile("hdfs://...")
file.flatMap(line => line.split(" "))
    .map(word => (word, 1))
    .reduceByKey(_ + _) 
With so much happening in the Big Data space, I will try to keep this blog updated with some of the latest happenings.

Monday, February 10, 2014

Review of Learning Python (5th Edition)

Thanks to Vincent Danen for the picture. `A picture is worth a thousand words`. The book in the foreground is the 3rd edition of `Learning Python` and in the back is the 5th edition. The book had been getting thicker edition-by-edition. I did a quick look in Amazon and the below table gives the size of each edition and the published date. Maybe, we can use non-linear regression to figure out the size of the next edition :)
Kidding aside, the author (Mark Lutz) of Learning Python - 5th edition does a very good job introducing Python and slowly moving into some of the advanced topics. But, the only gripe I have is that the book is huge (cannot be carried easily) and that the author repeats some of the topics again and again. So, if you are a quick reader like me, then you can quickly skip some of the repeated content and focus more time on the topics of interest.

For those who are into Python for some quick results, this book is certainly not an option. But, if you are into Python for a long haul for using it with Data Science or something else, then the book is worth the time. The book is about Python in general, so can be applied to other areas in Python (Scripting, Data Science, Dynamic Pages etc). The author also mentions the Python ecosystem at a very high level, so this book also gives a 360 view of Python.

Once familiar and comfortable with Python, the author also published Python Pocket Reference and also Programming Python (on how to develop applications in Python).

Friday, February 7, 2014

Big Data Scenarios / Use Cases

Very often I do get the query `I know Hadoop, Hive, Pig etc. Where do I start using it?`. One quick way is to figure out what others had been doing in the domain of interest. This can be done in multiple ways:

1) Follow some of the Big Data related blogs like the ones from Cloudera, HortonWorks etc.  Some of the blogs do a good job of segregating blog entries into different categories as in the case of Cloudera.

2) Follow a Big Data aggregator like Planet Big Data and Big Data Made Simple. They don't have any original content, but act like a mere aggregator from various other places.

3) Kaggle has been the platform for holding Data Science competitions. Look at the description of the different challenges and also don't forget to follow their blog.
What's Next ? by Crystl from Flickr under CC

Thursday, February 6, 2014

Optical Archival Storage Technology in Facebook

Verbatim 5.25" floppy disk by goosmurf from Flickr under CC
With some much happening around Big Data, it is interesting to know some of the happenings in the storage space even for those who are not much into hardware. Here is an interesting perspective from James Hamilton on how Facebook uses optical technology for cold storage (aka archival).

This Facebook hardware project is particularly interesting in that it’s based upon an optical media rather than tape. Tape economics come from a combination of very low cost media combined with only a small number of fairly expensive drives. The tape is moved back and forth between storage slots and the drives when needed by robots. Facebook is taking the same basic approach of using robotic systems to allow a small number of drives to support a large media pool. But, rather than using tape, they are leveraging the high volume Blu-ray disk market with the volume economics driven by consumer media applications. Expect to see over a Petabyte of Blu-ray disks supplied by a Japanese media manufacturer housed in a rack built by a robotic systems supplier.

Here is a video from Facebook showing the actual hardware and an article from Arstechnica. Finally, below is a video (around 30 minutes) with Facebook VP Jay Parikh discussing cold storage and Blu-rays.

Tuesday, February 4, 2014

Free Python books

Lately I had been drumming about Python for Data Science and had been spending time learning the same. Many of us might have used Python for creating dynamic web pages, server side automation and other general purpose requirements. So, using Python for Data Science would be a natural extension. Here are a lot of free Python book from freepythontips.
Books by Gavin Gilmour from Flickr under CC

Monday, February 3, 2014

Introduction to Spark

In an earlier blog we looked at RDD, they form the basis for Spark. I planned to write in detail about Spark, but DBMS2 does a very good job summarizing about `Spark and Databricks` here. Databricks which is in stealth mode would be mostly providing services (in cloud) and commercial support around Spark. Cloudera is actively pushing (1, 2) Spark and sooner or later we would see Spark in CDH.

Spark and Machine Learning are a nice combination. MapReduce processing provides high latency and high throughput and is not well suited for ML processing which are iterative in nature. R and Python (1, 2) interfaces to Spark are still a work-in-progress. So, over time it should be possible to use the rich ML/Statistical libraries of R/Python with Spark.

Btw, here is the original paper on Spark.

Friday, January 31, 2014

Bloom Filters in HBase and Chrome

Bloom Filters allows to efficiently check if a particular element/record is there in the set/table or not. It has very minimal impact on the insert operations. The only caveat is that it might return a false positive, Bloom filter might say that a particular element/record is there in the set/table even when that particular item of interest is not there. Bloom Filters have been implemented in HBase and are by default enabled.

Interesting to know that Google Chrome browser used to implement Bloom Filters and has been later replaced with an alternate approach. According to Wikipedia (outdated):

The Google Chrome web browser uses a Bloom filter to identify malicious URLs. Any URL is first checked against a local Bloom filter and only upon a hit a full check of the URL is performed.

HBase implements Bloom filters on the server side, while Chrome implements the Bloom filter on the client/browser side. So, Bloom filter data is up to date in HBase, but might be a bit outdated in the Chrome browser. There are some variations of Bloom Filters, but the basic concept is very simple and beautiful.
Blooming Georgia St. from Flickr by Tiberiu Ana under CC

Thursday, January 30, 2014

Hadoop in a box (revisit)

It's possible to setup a small Hadoop cluster on a single machine using virtualization. I blogged about it here a year back without the nitty gritty details, but Cloudera published a blog here with much more details.
Here is an alternate way of setting a bigger cluster on a single machine using Linux containers. This solution only works for Linux as the host OS and is less burden on the host OS, but the Cloudera mentioned approach works on multiple host OS. I haven't given the Linux containers options a try, but seems very interesting.

The same concept can be extended for non Hadoop frameworks also.

`Technology Radar` 2014 from ThoughWorks

ThoughWorks has released `Technology Radar` for 2014. They have categorized techniques / tools / platforms / languages/ frameworks into different rings (adopt / trial / assess / hold). The rings summarize how ready the different aspects of technology are for adoption. The paper is good to get some of the hip terms around technology.
Reggae - Rare Music Vided by raremusicvideo1 from Flickr under CC

Resources to get started with Machine Learning

With all the action happening around Machine Learning and Artificial Intelligence (1, 2, 3, 4 etc), looks like this year we would be seeing more and more interesting things happening around Big Data. One of the area we can expect a lot of work to happen in the near future is usability. Some of the the companies like BigML are already working in this space.

ML is being pushed by the availability of the infrastructure at lower costs for the last few years. Amazon has cut the prices 40 times since the launch of AWS. Facebook is using Blu-ray disks to bring down the costs of storing the cold data.

As I continue my journey through learning Machine Learning and Statistical Process, I will keep the Machine Learning page (also available from the top tabs) updated with some of the best resources available (free and commercial). As of now I have seeded with some videos, books, blogs, data sets, use cases. It's nice that the sessions from top universities are being provided for free. Some of these sessions overlap, but they provide a different perspectives around the various aspects of Machine Learning.

With some much information available, it's difficult to focus and to differentiate the best from the others. The ML page is not a comprehensive/perfect/exhaustive list, but it is something to begin with and I will keep it updated as I come across more interesting and useful resources.

If you have and interesting/useful resources to be included to the above mentioned page, send it across to
Machine Learning by Erik Charlton on Flickr under CC

Wednesday, January 29, 2014

Interview with John Chambers (creator of S)

Here is an interview with John Chambers (creator of S on the left) about the history behind R & S. S is considered predecessor of R. He is also a core member of the R team. The same is also summarized in a pdf here.

He is very passionate about what he has been working on, even remembers exact dates of some of the events which happened like 30 years back.

Tuesday, January 28, 2014

Man behind DeepMind

Some interesting (inspiring is more appropriate word) facts about Demis Hassabis, the man behind DeepMind technologies. DeepMind is the AI company recently acquired by Google for a cool 400m $.

Hats off.

Monday, January 27, 2014

Subscribe by email to

One way to get updated with Big Data and any other technology is to closely follow what others had been doing and saying through blogs (both companies and individual). Naturally a companies blog usually puts a positive spin about their services and products, but some of the individual blogs are unbiased about what they write. With so much happening around Big Data, it's not possible to follow the huge number of blogs. One way to get around this is to use an RSS aggregator like Feedly. In spite of RSS aggregators being easy to use and free, their usage has been very limited. Google decided mid of 2013 to discontinue Google Reader and this gave an opportunity for alternate RSS aggregators to spring up.

For those who had been following this blog, you can now subscribe to it by providing your email address on the top right. This way you can be sure that you won't be missing some of the latest happenings around Big Data.

There are claims that email is dead with the advent of social media, but most of us still use it. I am with the strong belief that both email and social media have their own space. The same goes with blogging also, people claim that blogging is dead with the advent of social media.
Email email email by RambergMediaImages on Flickr under CC

Sunday, January 26, 2014

Statistical Learning from Stanford Online

Stanford is offering a MOOC on Statistical Learning. It's free, more details here. The course has already started a couple of days back, but you can still watch the archives once registered to the course. The Coursera ML course is really good, but sometimes it deep dives into Maths, which makes it not for everyone. The good thing about the Stanford offering is

This is not a math-heavy class, so we try and describe the methods without heavy reliance on formulas and complex mathematics.

On the other side, the focus is on R for statistical analysis. For those who are into Python, they can still go through the course. DataRobot is planning for a follow on blogs for the techniques discussed during the Stanford sessions using Python. DataRobot is still in the stealth mode, so you can follow their blog here using Feedly or some other RSS aggregator.

Note that the corresponding book `An Introduction to Statistical Learning with Applications in R` for the course can be downloaded here for free.
Free sign by Alan O'Rourke on Flickr under CC

Friday, January 24, 2014

Resilient Distributed Datasets (RDD) for the impatient

The Big Data revolution was started by the Google's Paper on MapReduce (MR). But, the MR model mainly suits batch oriented processing of the data and some of the other models are being shoe horned into it because of the prevalence of Hadoop and the attention/support it gets. With the introduction of YARN in Hadoop, other models besides MR could be first-class-citizens in the Hadoop space.

Lets take the case of MR as shown below, there is a lot of reading and writing happening to the disk after each MR transformation which makes is too slow and less suitable for iterative processing as in the case of Machine Learning.
Let's see how we can solve the iterative problem with Apache Spark. Spark is built using Scala around the concept of Resilient Distributed Datasets (RDD) and provides actions / transformations on top of RDD. It has one of the best documentation around open source projects. There was not much resources around RDD, but this paper and presentation are the roots of RDD. Check this to know more about RDDs from a Spark perspective.

Let's look at a very high level what RDDs are and use this as a foundation to build upon Spark and other related frameworks in the upcoming blog articles. According to earlier mentioned paper

Formally, an RDD is a read-only, partitioned collection of records. RDDs can only be created through deterministic operations on either (1) data in stable storage or (2) other RDDs.

This is the holy grail of what an RDD is. RDDs are a 'immutable resilient  distributed collection of records' which can be stored in the volatile memory or in a persistent storage  (HDFS, HBase etc) and can be converted into another RDD through some of the transformations. An action like count can also be applied on an RDD.
As observed in the above flow, the data flow from one iteration to another happens through memory and doesn't touch the disk (except for RDD2). When the memory is not sufficient enough for the data to fit it, it can be either spilled to the drive or is just left to be recreated upon request for the same.

Because of the distributed nature of the Big Data processing, there is a better probability that a couple of nodes might go down at any point of time. Note that in the above flow, the RDD2 is persisted to disk because of Check Pointing. In the work flow, for any failure during the transformations t3 or t4, the entire work flow need not be played back because the RDD2 is persisted to disk. It would be enough if transformation t3 and t4 are played back.

Also, RDD can be cached in memory for frequently cached data. Lets say different queries are run on the same set of data again and again, this particular data can be kept in memory for better execution times.

To summarize, for iterative processing MR model is less suited than the RDD model. Performance metrics around iterative and other processings are mentioned in detail in this paper around RDD.

Safari Online All-Access Subscription

Gamler's Arrow by Seema Krishnakumar on Flickr
Big data is like a moving target and a month doesn't go by without a new book /  framework / company / VC funding etc around Big Data. So, I finally jumped into the Safari Online Subscription (Individual) to get myself to speed.

I was under the impression they would be offering only O'Reilly publication books, but there are books from a lot of other publishers also as shown here. They have a huge collection of books and videos which can be searched easily using the topic of interest. One of the video I marked is Hilary Mason: Advanced Machine Learning.
There are multiple subscriptions as shown below with Safari Online. And the main advantage of the `ALL-ACCESS` subscription is that it gives access to Rough Cuts which are work-in-progress books, but not yet published. So, I am able to access Apache Hadoop YARN: Moving beyond MapReduce and Batch Processing with Apache Hadoop 2 and other books around some of the latest technologies.
Only gripe is that there is no offline support, so I need to be connected all the time to read the books. Looks like limited offline supported is provided, more details here. Also, multiple books come-up when searching for a particular topic and there is no user rating to decide what book to go with.

Would definitely recommend to go with this subscription for those  who are planning or deep into Big Data.

Monday, January 20, 2014

Review of `Building Machine Learning Systems with Python` Book

For some time I wanted to get started with Machine Learning (ML) and was looking for some good resources (books, videos, articles etc). Some of them deep dive into the ML topics (with too much of stats and maths) without any practical aspects and some of them are with a minimal of practical examples. But, I wanted to get started with the practical aspects of ML quickly without deep diving into the ML concepts, which I plan to pursue later.

It's like I don't want to know in detail how a car works, but still I would like to drive a car. Although knowing how a car works, will really help when there is a break down in the middle of nowhere. Applying the same analogy, there are a lot of frameworks in different languages which implement the ML algorithms and it's a matter of knowing which ML algorithms to use given a problem and call the appropriate framework API. Later at some point of time, we might be interested in the nitty gritty details of how a particular ML algorighm works or is implemented to fine tune it, but it would be good to get some quick results without getting into the minute details.

`Building Machine Learning Systems with Python` maintains a perfect balance between the theoretical and practical aspects of ML. It goes with the assumption that the reader is familiar (not an expert) with Python and ML. For getting started/familar with ML would recommend `The Shape Of Data` and for Python would recommend 1 and 2. There are a lots of editors for Python, but I had been using Eclipse PyDev plugin. For those who are familiar with Eclipse environment, it would be very easy to get started with PyDev.

Just to add there are a lot of ML frameworks for Python (1, 2, 3, 4 etc). But, as of this writing couldn't find any Python framework which implements ML algorithms in a distributed fashion. I have posted a query in SO here and waiting for response. Not exactly sure if Python is a good option for distributed ML, but some of the options for distributed ML around Java are Apache Mahout and the recent Oryx from Cloudera. But, as the size of the data sets grow it makes sense to have some nice frameworks implementing ML algorithms in a distributed fashion using Python. Here is an article mentioning on how to use Mahout with Python, but both wrappers and native interfaces have their own space. (scikit-learn support is being added to Hadoop, more details here).

As mentioned in the `O'reilly Data Science Salary` there is a close tie between Python and R around Data Science. Here is an interesting comparison from Revolution Analytics between the usage of Python and R. Revolution Analytics or any other company will in fact mention good about their products, but they have got some metrics on the same.

I am going through the above mentioned book and would be writing a detail review about the book. I am really excited to get the book and couldn't stop writing about it. I would also blogging about my experience with Python and ML in particular. So, keep following.
6 24 09 Bearman Cartoon Artificial Intelligence copy by Bearman2007 on Flickr

Thursday, January 16, 2014

2013 Data Science Salary Survey from O'reilly

Just now `2013 Data Science Salary Survey` landed in my mail box. The executive summary in `Page 8` is really interesting. Not sure how big the sampled date is, but the report shows some interesting correlations. SQL/RDB tops the list of data analytics tools, followed by R and Python which are very close.
Money by Philip Taylor PT on Flickr

Wednesday, January 15, 2014

Map and Reduce in Python without Hadoop

MapReduce is not a new programming model, but the Google's paper on MapReduce made it popular. A map is usually used for transformation, while reduce/fold is used for aggregation. They are built-in primitives used in functional programming languages like Lisp and ML. More about the functional programming roots to MapReduce paradigm can be found in Section 2.1 of Data-Intensive Text Processing with MapReduce paper
Below is a simple Python 2 program using the map/reduce functions. map/reduce are functions in the __builtin__ python module. More about functional programming in Python here. For those using Python3, the reduce function has removed from the __builtin__ package. According to the Python 3.1 release notes :

Removed reduce(). Use functools.reduce() if you really need it; however, 99 percent of the time an explicit for loop is more readable.

The Python 2 program `squares/transforms` a list of 1 to 100 using `map/square` and then `sums/aggregates` them up using the `reduce/add` function. Note that Hadoop which provides a run time environment for executing MapReduce programs also does something similar, but in a distributed fashion to process huge amounts of data.

def square(x):
    return x * x

def add(x, y):
    return x + y

def main():
    print reduce(add, map(square, range(100)))

if __name__ == "__main__":

Tuesday, January 7, 2014

Oozie hangs on a single node for work flow with fork

Apache Oozie is a work flow engine which allows for a DAG of tasks to be run. In earlier blog entries we looked at installing/configuring Oozie, creating/executing simple work flows and finally creating a coordinator/scheduler using Oozie. Azkaban from LinkedIn is similar to Apache Oozie.
For each action  (like Hive, Pig, Sqoop etc), Oozie launches an MR launcher which in turn launches the actual action. Here is a discussion from the Apache forums on the rational behind this.

In the above work flow, after the fork (in green) the Pig and Sqoop actions (in yellow) are launched by two separate MR launchers in parallel. So, at this point of time four Map slots are required (two for MR launchers, one for Pig and one for Sqoop). But, default only two Map slots are available in each node. That is, only two Map tasks will run at any point of time in a node. This is specified by the `` property which defaults to 2 in the mapred-site.xml file.

The two available slots are occupied by the Oozie MR launchers and there are no more Map slots available to run the Map tasks related to Pig and Sqoop action. In this case the Job Tracker keeps on waiting for the availability of the Map slots which will never happen.

The above mentioned problem happens only in the case of Hadoop running in the pseudo distributed mode (i.e., single node/machine) with the default configurations which is usually the case of development. One solution is to increase the number of nodes and configure the Hadoop cluster in a fully distributed mode. Another solution is to bump up the `` property in the mapred-site.xml as below.

Monday, January 6, 2014

Different ways of getting started with Big Data

For those who are getting started with Big Data, there are a multitude of options with their own pros and cons. Without going further in detail about each of them, below is a mind map with some of the options. Based on the budget allocated to the initiative, time-to-market, core expertise of the company, industry regulations and various other factors the appropriate one can be picked.
For example, there is a better probability that a startup planning a Big Data initiative would be more towards a cloud based service to avoid the initial CAPEX and also for the sake of instantaneous scalability. Similarly, for a bank there is better probability towards an in-premise commercial distribution (like CDH from Cloudera, HDP from Hortonworks) as they can concentrate on their line of business than on the issues that arise with the Big Data environment and also because of the industry regulations.
{pick*me*2} by { lillith } from Flickr under CC
For those who are following the Big Data space, a week never goes by without the mention of a new startup / framework / funding. It will be some time before we see consolidation, with only the strong players left.


Thursday, January 2, 2014

Getting started with Big Data - Part 3 - Installation and configuration of Apache Bigtop
In earlier blog entries we looked at how to install VirtualBox and then installing Ubuntu on top of VirtualBox. In the final series, we will look on how to install Bigtop on top of the Ubuntu Guest OS. From the Apache Bigtop site

The primary goal of Bigtop is to build a community around the packaging and interoperability testing of Hadoop-related projects. This includes testing at various levels (packaging, platform, runtime, upgrade, etc...) developed by a community with a focus on the system as a whole, rather than individual projects.

Open source software/frameworks work good individually, but it takes some effort/time to integrate them, the main challenge is the interoperability issues between the different frameworks. This is where companies like Cloudera, Hortonworks, MapR and others come into play. They take the different frameworks from the Apache Software Foundation and make sure they play nice to each other. Not only do they address the interoperability issues, but also make performance/usability enhancements.

Apache Bigtop takes this effort to a community level from a individual company level. Bigtop can be compared to Fedora, while Cloudera (CDH) / Hortonworks (HDP) / MapR (M5/M5/M7) can be compared to RHEL. Red Hat provides commercial support for RHEL, while Cloudera / Hortonworks / MapR provide commercial support for their own distributions of Hadoop. Also, Fedora has some of the leading edge and more variety of softwares and same is the case with Bigtop also. A lot of Apache frameworks (like Mahout / Hama) are included in Bigtop, but not in the commercial distributions like CDH/HDP/M*.

For those who wanted to deep dive into Big Data, Bigtop makes sense as it includes a lot of additional Big Data frameworks. Also, there are not many restrictions on it's usage. More on What is Bigtop, and Why Should You Care?

Here is the official documentation on installing Bigtop. But, the documentation is a bit outdated and has some steps missing. Here are the steps in detail.

- Install Java as mentioned here. Make sure that Oracle JDK 6 is installed and not JDK 7, because Bigtop has been tested with JDK 6.