Tuesday, June 26, 2012

Beyond Hadoop WordCount


I often get the question `Not that I have run the wordcount example - what next? What else can implemented on top of Hadoop?`. Here are some of the options to consider

- Go through the code in the Hadoop examples package and understand in detail how MapReduce works.

- Implement some of the examples in `Data-Intensive Text Processing with MapReduce`.

- Pick a topic of interest from the blog entry from atbrox and start implementing it in MapReduce.

BTW, although algorithms mentioned above might be implemented in MapReduce, but it might not be the best model to implement the algorithms. Start considering alternative models like BSP. Take a look at Apache Hama and Giraph frameworks for implementing some of the above mentioned algorithms. Especially, iterative algorithms like PageRank can be efficiently implemented over Hama and Giraph, when compared to Hadoop.

Edit (27th August, 2012) : This blog article has a nice summary on how to get started analyzing the public data sets.

No comments:

Post a Comment