Thursday, May 22, 2014

Pig as a Service: Hadoop challenges data warehouses

Thanks to Gil Allouche (Qubole's VP of Marketing) for this post.

Hadoop and its ecosystem has evolved from a narrow map-reduced architecture to a universal data platform set to dominate the data processing landscape in the future. Importantly, the push to simplify Hadoop deployments with managed cloud services known as Hadoop-as-a-Service is increasing Hadoop’s appeal to new data projects and architectures. Naturally, the development is permeating the Hadoop ecosystem in shape of Pig as a Service offerings, for example.

Pig, developed by Yahoo research in 2006, enables programmers to write data transformation programs for Hadoop quickly and easily without the cost and complexity of map-reduce programs. Consequently, ETL (Extract, Transform, Load), the core workload of DWH (data warehouse) solutions, is often realized with Pig in the Hadoop environment. The business case for Hadoop and Pig as a Service is very compelling from financial and technical perspectives.

Hadoop is becoming data’s Swiss Army knife
The news on Hadoop last year have been dominated by SQL (Structured Query language) on Hadoop with Hive, Presto, Impala, Drill, and countless other flavours competing on making big data accessible to business users. Most of these solutions are supported directly by Hadoop distributors, e.g. Hortonworks, MapR, Cloudera, and cloud service providers, e.g. Amazon and Qubole.

The push for development in the area is driven by the vision for Hadoop to become the data platform of the future. The release of Hadoop 2.0 with YARN (Yet Another Resource Negotiator) last year was an important step. It turned the core of Hadoop’s processing architecture from a map-reduce centric solution into a generic cluster resource management tool able to run any kind of algorithm and application. Hadoop solution providers are now racing to capture the market for multipurpose, any-size data processing. SQL on Hadoop is only one of the stepping-stones to this goal.

Friday, May 16, 2014

User recommendations using Hadoop, Flume, HBase and Log4J - Part 2

Thanks to Srinivas Kummarapu for this post on how to show the appropriate recommendations to a web user based on the user activity in the past.

In the previous blog we have seen how to Flume the user activities into the Hadoop cluster. On top of these user activities some analysis can be done to figure out what a particular user is interested in.

For example if a user wants to buy a mobile from a shopping site and ended up buying none, we got all his activities into Hadoop cluster on which analysis can be done to figure out what type of phones that particular user is interested in. The interested phones can be recommended when the user visits the site again.

The user activities in the HBase consists of only mobile name and no more details. More details about the mobile phone can be maintained in a RDBMS. We need to do join the RDBMS data (mobile details) with the HBase to send the information to the Recommendations tables of RDBMS in order to recommend the user.

Here we have two options to perform Joins.

1) Send the result of the Hadoop cluster to RDBMS and do Joins there.
2) Get the RDBMS data into HBase to perform join in parallel distributed fashion.

Both can be done by a Map-Only Jobs tool called Sqoop (SQl to haOOP).
In this article we will see how to Sqoop the RDBMS table into the HBase database in an incremental fashion.

Friday, May 9, 2014

User recommendations using Hadoop, Flume, HBase and Log4J - Part 1

Thanks to Srinivas Kummarapu for this post on how to show the appropriate recommendations to a web user based on the user activity in the past.

This first of a four part article is with the assumption that Hadoop, Flume, HBase and Log4J have been already installed. In this article we will see how to track the user activities and dump it into HDFS and HBase. In the future articles, we will look into some kind of basket analysis from the data in HDFS/HBase and will project the same to the transaction database for recommendations. Also, refer this article to Flume the data into HDFS.

Friday, May 2, 2014

Looking for guest bloggers at

The first entry had been posted on 28th September, 2011 on this blog. Initially I started blogging as an experiment, but lately I had been having fun and liking to blog.

Not only the traffic to the blog had been increasing at a very good pace, but also I had been making quite a few acquaintances and also getting a lot of nice and interesting opportunities through the blog. I got offers to write a book, an article, blog on some other sites and others.

I am looking for guest bloggers to this blog. If you or someone else is interested then please let me know

a) a bit about yourself (along with LinkedIn profile)
b) topics you are interested in to write on this blog
c) references to articles written in the past if any
I don't want to put a lot of restrictions around this, but here are a few

a) the article should be authentic
b) no affiliate or promotional links to be included
c) the article can appear elsewhere after 10 days with a back link to the original

I am open to any topics around Big Data, but here are some of the topics I would be interested in

a) a use case on how you company/startup is using Big Data
b) using R/Python/Mahout/Weka for some interesting data processing
c) integrating different open source frameworks
d) comparing different open source frameworks with similar functionalities
e) ideas and implementation of pet projects or POC (Proof Of Concepts)
f) best practices and recommendation
g) views/opinions of different open source framework

As a bonus, if a blog gets posted here then it will also include a brief introduction about the author and a link to his/her LinkedIn profile. This will give enough publicity for the author.

If you are a rookie and writing for the first time, that shouldn't be a problem. Everything begins with a simple start. Please let me know at if you are interested in blogging here.