Sunday, September 23, 2012

Big Data - Making it work @ HUG-Hyderabad

HUG-Hyderabad meeting was conducted in Infosys Hyderabad campus on 22nd September, 2012 and it was way beyond what was expected and I made some good networking with the like minded. Here is the agenda for the meeting. Along with the other presentations the data visualization presentation from Gramener was really good and got me excited. It's not only important to process the data, but it's even more important to visualize the data. Here is the presentation by Naveen from Gramener.

Hadoop has become like a buzz word and there is an effort to fit every problem with a Hadoop Solution. As you have noticed from the blog, I am pretty much interested in alternate frameworks and models to Hadoop and MapReduce like Hama/Giraph and BSP. So, I talked about some of the strengths and weakness of Hadoop along with a few alternatives.  The audience were really interested in it. Here is the presentation I have used for my session.

There had been IT bubbles and booms in the past and there will be there more in the future also. While some of them were geared towards decreasing the costs and some towards making more profits, Big Data technology has started making a direct impact on our every day lives from Health Care to Travel to Education. There is no aspect in life which isn't or can't be benefited from Big Data. Here is another presentation by Sudheer Marisetti from Abacus Concepts.

Enterprises are adopting open source clouds to avoid any vendor lock-ins and Service Providers are using open source clouds due to the reason that it can be customized to differentiate them from other Service Providers . Here is a nice article from GigaOM on `What role does open source play in cloud computing innovation?`. Ram Chinta from CloudStack shared the below `Community Connect : Apache CloudStack` presentation from the HUG session. Also, CloudStack meetups are being planned in Hyderabad. For those interested please join the CloudStack Hyderabad Group here.

The next HUG meeting would be mostly scheduled in a month or so in Hyderabad and I am anxiously waiting for it. I encourage those who are interested in Big Data to register here for the upcoming event in meetup and attend it. Also, if anyone is interested in hosting the event in their company in Hyderabad please let me know at

Have a nice day !!!

Monday, September 10, 2012

Why does Hadoop uses KV (Key/Value) pairs?

Hadoop implements the MapReduce paradigm as mentioned in the original Google MapReduce Paper in terms of key/value pairs. The output types of the Map should match the input types of the Reduce as shown below.

(K1,V1) -> Map -> (K2,V2)
(K2, V2) -> Reduce -> (K3,V3)

The big question is why use key/value pairs?

MapReduce is derived from the concepts of functional programming. Here is a tutorial on the map/fold primitives in the functional programming which are used in the MapReduce paradigm and there is no mention of key/value pairs anywhere.

According to the Google MapReduce Paper

We realized that most of our computations involved applying a map operation to each logical “record” in our input in order to compute a set of intermediate key/value pairs, and then applying a reduce operation to all the values that shared the same key, in order to combine the derived data appropriately.

The only reason I could see why Hadoop used key/value pairs is that the Google Paper had key/value pairs to meet their requirements and the same had been implemented in Hadoop. And everyone is trying to fit the the problem space in the key/value pairs.

Here is an interesting article on using tuples in MapReduce.