Sunday, October 28, 2012

Debugging a Hadoop MapReduce Program in Eclipse

Note : Also don't forget to do check another entry on how to unit test MR programs with MRUnit here. And here is a screencast for the same.

Distributed applications are by nature difficult to debug, Hadoop is no exception. This blog entry will try to explain how to put break points and debug a user defined Java MapReduce program in Eclipse.

Hadoop support executing a MapReduce job in Standalone, Pseudo-Distributed and Fully-Distributed Mode. As we move from one more to another in the same order, the debugging becomes harder and new bugs are found on the way. Standalone mode with the default Hadoop configuration properties allows MapReduce programs to be debugged in Eclipse.

Step 1: Create a Java Project in Eclipse.

Step 2: For the Java project created in the earlier step, add the following dependencies (commons-configuration-1.6.jar, commons-httpclient-3.0.1.jar, commons-lang-2.4.jar, commons-logging-1.1.1.jar, commons-logging-api-1.0.4.jar, hadoop-core-1.0.3.jar, jackson-core-asl-1.8.8.jar, jackson-mapper-asl-1.8.8.jar and log4j-1.2.15.jar) in Eclipse. The dependencies are available by downloading and extracting a Hadoop release.

Step 3:  Copy the,,,, to the src folder under the project. The Sample.txt file which contains the input data should be copied to the input folder. The project folder structure should look like below, without any compilation errors.

Step 4: Add the input and the output folder as the arguments to the program.

Step 5: Execute from Eclipse. There should be no exceptions/errors shown in the console. And on refreshing the project, an output folder should appear as should below on successful completion of the MapReduce job. To rerun the program, the output folder has to be deleted.

Step 6: As in the case of any Java program, break points can be put in the MapReduce driver, mapper, reducer code and debugged.

In the upcoming blog, we will see how to include/compile/debug Hadoop code into Eclipse along with the user defined driver, mapper and the reducer code.

Happy Hadooping !!!!

Note (5th March, 2013) : The above instructions have been tried on Ubuntu 12.04 which has all the utilities like chmod and others, which Hadoop uses internally. These tools are not available by default in Windows and you might get error as mentioned in this thread, when trying the steps mentioned in this blog on a Windows machine.

One alternative it to install Cygwin on Windows as mentioned in this tutorial. This might or might not work smoothly.

Microsoft is working very aggressively to port Hadoop to the Windows platform and has released HDInsight recently. Check this and this for more details. This is the best bet for all the Windows fans. Download the HDInsight Server on a Windows machine and try out Hadoop.

Sunday, October 7, 2012

Google driving the Big Data space

Google has unique requirements with respective to data processing and storage which no one has. According to the WikiPedia, Google has to process about 24 Peta Bytes of data per day which be a bit outdated and Google might be processing more data per day. So, they need to continuously innovate to address the unique requirements. Soon they outgrow the innovation and they come up with some new innovation.

The good thing is that Google had been continuously releasing these innovations  as papers once they have it refined and there is a solid internal implementation of it.

These Google Papers have been implemented by the ASF (Apache Software Foundation) and others. It's taking some time for the ASF frameworks like Hadoop and others to production ready. There is a catchup between Google papers and the ASF on a continuous basis.

Google Paper Apache Frameworks

The Google File System (October, 2003)

HDFS (2008 became Apache TLP)

MapReduce: Simplified Data Processing on Large Clusters (December, 2004)

MapReduce (2008 became Apache TLP)

Bigtable: A Distributed Storage System for Structured Data (November, 2006)

HBase (2010 became Apache TLP), Cassandra (2010 became Apache TLP)

Large-scale graph computing at Google (June, 2009)

Hama, Giraph (2012 became Apache TLP)

Dremel: Interactive Analysis of Web-Scale Datasets (2010)

Apache Drill (Incubated in August, 2012), Imapala from Cloudera.

Large-scale Incremental Processing Using Distributed Transactions and Notifications (2010)


Spanner: Google's Globally-Distributed Database (September, 2012)


Following the research/papers published by Google and related blogs/articles gives an idea where Big Data is moving. Many might not have the same requirements nor the resources as Google, so we would be seeing more and more cloud services for the same.

Edit (26th October, 2012) : Here is an article from Wired echoing what was mentioned in the blog. Very inspiring.

Edit (9th October, 2013) : Here is another perspective from Steve Loughran on thinking beyond what Google had been doing.

Friday, October 5, 2012

Some tips for sending emails through Amazon SES (Simple Email Service)

One of the requirement we had was of sending thousands of email newsletters. Without a second thought we decided to use Amazon SES for sending the emails, because as with any of the cloud service there is no initial cost and we need to pay on a need basis with no commitment. Also, the documentation around AWS (Amazon Web Services) is really good and the AWS Eclipse Toolkit was useful to get started.

Many of the emails (in thousands) in our email database were invalid which resulted in a bounce or a complaint. AWS SES handles these invalid mails by sending individual email or/and notifications through AWS SNS. Here is flow for the same.

** Courtesy Amazon

Initially we choose the notification by email. At the end, we had thousands of emails back for both bounced and complaints and had to extract the individual email address from those mails to clean our email database. This was a cumbersome task. So, we choose to alternate route of sending the bounced and complaint email addresses to SNS and then forward them to AWS SQS. Then a Java program using the AWS SDK for Java pulled the messages from the Queue.

Every thing was nice and good till now. But, when AmazonSQS.receiveMessage(ReceiveMessageRequest receiveMessageRequest) was called on the Queue only a single message was returned inspite of having around a thousand messages in the queue. The below is the probable reasoning for it from the  ReceiveMessage documentation.

>> Due to the distributed nature of the queue, a weighted random set of machines is sampled on a ReceiveMessage call. That means only the messages on the sampled machines are returned. If the number of messages in the queue is small (less than 1000), it is likely you will get fewer messages than you requested per ReceiveMessage call. If the number of messages in the queue is extremely small, you might not receive any messages in a particular ReceiveMessage response; in which case you should repeat the request.

To make life easy, one of the option was to invoke the ReceiveMessageRequest.setMaxNumberOfMessages(Integer maxNumberOfMessages). But, this was also returning a maximum of 10 messages at a time. An exception is thrown when the maximum number of messages is set to more than 10. We stopped automating further, by repeatedly calling AmazonSQS.receiveMessage() till the number of messages in the queue reaches zero. Checking the number of messages in the Queue can be done by calling the GetQueueAttributesResult.getAttributes() method, but this is again not returning a consistent value because of the distributed nature of the queue.

Amazon Web Services is a awesome service and I am really impressed by how quickly someone can get started, but there are small quirks here and there which need to be addressed. The purpose of this blog entry is to help those who are getting started with AWS for sending emails.

For those who are getting started with AWS, Amazon offers a free usage tier. The only caution is to stop the services when not needed, so that the usage doesn't cross the free limit.

Wednesday, October 3, 2012

Is Java a prerequisite for getting started with Hadoop?

The query I get often from those who want to get started with Hadoop is if knowledge of Java is a prerequisite or not. The answer is both a Yes and a No, depends on the individual persons interest on what they would like to do with Hadoop.

Why No?

MapReduce provides Map and Reduce primitives which had been as Map and Fold primitives in the Functional Programming world in language like Lisp from quite some time. Hadoop provides interfaces to code in Java against those primitives. But, any language supporting read/write to STDIO like Perl, Python, PHP and others can also be used using Hadoop streaming feature.

Also, there are high level abstractions provided by Apache frameworks like Pig and Hive for which familiarity of Java is not required. Pig can be programmed in Pig Latin and Hive can be programmed using  HiveQL. Both of these programs will be automatically converted to MapReduce programs in Java.

According to the Cloudera blog entry `What Do Real-Life Hadoop Workloads Look Like?` - Pig and Hive constitute a majority of the workloads in a Hadoop cluster. Below is the histogram from the mentioned Cloudera blog.

Why Yes?

Hadoop and the ecosystem can be easily extended for additional functionality like developing custom Input and OutputFormats, UDF (User Defined Functions) and others. For customizing Hadoop knowledge of Java is mandatory.

Also, many times it's required to get deep into Hadoop code as to why something is behaving a particular way or to know more about the functionality of a particular module. Again knowledge of Java comes handy here.

Hadoop projects come with a lot of different roles like Architect, Developer, Tester, Linux/Network/Hardware Administrator and some of which require explicit knowledge of Java and some don't. My suggestion is if you are genuinely interested in Big Data and think that Big Data will make a difference then deep dive into Big Data technologies irrespective of knowledge about Java.

Happy Hadooping !!!!