Sunday, October 28, 2012

Debugging a Hadoop MapReduce Program in Eclipse

Note : Also don't forget to do check another entry on how to unit test MR programs with MRUnit here. And here is a screencast for the same.

Distributed applications are by nature difficult to debug, Hadoop is no exception. This blog entry will try to explain how to put break points and debug a user defined Java MapReduce program in Eclipse.

Hadoop support executing a MapReduce job in Standalone, Pseudo-Distributed and Fully-Distributed Mode. As we move from one more to another in the same order, the debugging becomes harder and new bugs are found on the way. Standalone mode with the default Hadoop configuration properties allows MapReduce programs to be debugged in Eclipse.


Step 1: Create a Java Project in Eclipse.



Step 2: For the Java project created in the earlier step, add the following dependencies (commons-configuration-1.6.jar, commons-httpclient-3.0.1.jar, commons-lang-2.4.jar, commons-logging-1.1.1.jar, commons-logging-api-1.0.4.jar, hadoop-core-1.0.3.jar, jackson-core-asl-1.8.8.jar, jackson-mapper-asl-1.8.8.jar and log4j-1.2.15.jar) in Eclipse. The dependencies are available by downloading and extracting a Hadoop release.


Step 3:  Copy the MaxTemperature.java, MaxTemperatureMapper.java, MaxTemperatureReducer.java, MaxTemperatureWithCombiner.java, NewMaxTemperature.java to the src folder under the project. The Sample.txt file which contains the input data should be copied to the input folder. The project folder structure should look like below, without any compilation errors.


Step 4: Add the input and the output folder as the arguments to the MaxTemperature.java program.


Step 5: Execute MaxTemepature.java from Eclipse. There should be no exceptions/errors shown in the console. And on refreshing the project, an output folder should appear as should below on successful completion of the MapReduce job. To rerun the program, the output folder has to be deleted.


Step 6: As in the case of any Java program, break points can be put in the MapReduce driver, mapper, reducer code and debugged.


In the upcoming blog, we will see how to include/compile/debug Hadoop code into Eclipse along with the user defined driver, mapper and the reducer code.

Happy Hadooping !!!!

Note (5th March, 2013) : The above instructions have been tried on Ubuntu 12.04 which has all the utilities like chmod and others, which Hadoop uses internally. These tools are not available by default in Windows and you might get error as mentioned in this thread, when trying the steps mentioned in this blog on a Windows machine.

One alternative it to install Cygwin on Windows as mentioned in this tutorial. This might or might not work smoothly.

Microsoft is working very aggressively to port Hadoop to the Windows platform and has released HDInsight recently. Check this and this for more details. This is the best bet for all the Windows fans. Download the HDInsight Server on a Windows machine and try out Hadoop.

40 comments:

  1. Thanks for the tutorial, simple and accurate.
    I think the source code files you' re pointing to are missing,
    but anyway pretty helpful stuff.

    ReplyDelete
    Replies
    1. Thanks for tip - Fixed it - I am always in a hurry to delete files :)

      Delete
  2. How to write output dir onto HDFS server .. using above process what is code. ?

    ReplyDelete
    Replies
    1. Output has to be written in HDFS right ? but I can see the data in local, how ?

      Delete
  3. is there a way to switch between the standalone and pseudo-distributed modes?

    ReplyDelete
    Replies
    1. I have tried Eclipse with standalone mode and not with other modes.

      Praveen

      Delete
  4. This comment has been removed by the author.

    ReplyDelete
    Replies
    1. I did follow the steps in Ubuntu 12.04 which has the utilities like chmod and others, while the utilities are not there in Windows. So, cygwin has to be installed on Windows. Check the below thread from the Hadoop forums

      http://mail-archives.apache.org/mod_mbox/hadoop-common-user/201105.mbox/%3CBANLkTin-8+z8uYBTdmaa4cvxz4JzM14VfA@mail.gmail.com%3E

      Thanks for the tip, I will update the blog accordingly.

      Delete
    2. Here is a tutorial on installing Cygwin/Hadoop on Windows

      http://v-lad.org/Tutorials/Hadoop/03%20-%20Prerequistes.html

      But, I wouldn't bet on this approach since Microsoft is aggressively working to make Hadoop work on Windows.

      Delete
    3. Thanks for the reply......
      I just removed my emp id from the log and posting the question here again to help other to know what the question is....?

      Pardon me if you are gettign confused?
      13/03/06 19:37:23 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
      13/03/06 19:37:23 ERROR security.UserGroupInformation: PriviledgedActionException as:cause:java.io.IOException: Failed to set permissions of path: \tmp\hadoop-\mapred\staging\.staging to 0700
      Exception in thread "main" java.io.IOException: Failed to set permissions of path: \tmp\hadoop-madhu\mapred\staging\275\.staging to 0700
      at org.apache.hadoop.fs.FileUtil.checkReturnValue(FileUtil.java:689)
      at org.apache.hadoop.fs.FileUtil.setPermission(FileUtil.java:662)
      at org.apache.hadoop.fs.RawLocalFileSystem.setPermission(RawLocalFileSystem.java:509)
      at org.apache.hadoop.fs.RawLocalFileSystem.mkdirs(RawLocalFileSystem.java:344)
      at org.apache.hadoop.fs.FilterFileSystem.mkdirs(FilterFileSystem.java:189)
      at org.apache.hadoop.mapreduce.JobSubmissionFiles.getStagingDir(JobSubmissionFiles.java:116)
      at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:856)
      at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:850)
      at java.security.AccessController.doPrivileged(Native Method)
      at javax.security.auth.Subject.doAs(Subject.java:396)
      at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1121)
      at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:850)
      at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:824)
      at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1261)
      at maxtemp.MaxTemperature.main(MaxTemperature.java:36)

      Delete
    4. waiting for your tip to run hadoop map reduce on windows + eclipse...?

      Thanks in advance...

      Delete
  5. Thank You for your posts. I am doing my Master Thesis now and your web entries are very helpful.

    @Madhu Reddy
    Why to drill someone's head about windows? Better try to run it on linux. Try to prepare ready-made virtual linux image with pre-installed everything what's needed or try to find one (Virtual Box). It will save your time and you could use this experience in the future when you'll run hadoop for some real work.

    ReplyDelete
    Replies
    1. Also, Microsoft recently announced HDInsights (http://goo.gl/1LoHk) for running Hadoop on Windows Server and Azure. So, Hadoop will be natively supported on Windows by Microsoft

      But, not sure how many instances we will see of Hadoop/Windows. Hadoop clusters run into 10/100/1000s nodes and the Windows license has to got for all the machines.

      Delete
  6. I followed your tutorial, I thought to the letter. I am running into some pretty heavy errors - below. Any thoughts?


    13/04/21 22:09:27 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
    13/04/21 22:09:27 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same.
    13/04/21 22:09:27 WARN mapred.JobClient: No job jar file set. User classes may not be found. See JobConf(Class) or JobConf#setJar(String).
    13/04/21 22:09:27 INFO mapred.JobClient: Cleaning up the staging area file:/tmp/hadoop-hduser/mapred/staging/hduser1546474700/.staging/job_local_0001
    Exception in thread "main" java.lang.IllegalArgumentException: Wrong FS: file://home/hduser/workspace/output, expected: file:///
    at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:381)
    at org.apache.hadoop.fs.FileSystem.makeQualified(FileSystem.java:294)
    at org.apache.hadoop.fs.FilterFileSystem.makeQualified(FilterFileSystem.java:85)
    at org.apache.hadoop.mapred.FileOutputFormat.checkOutputSpecs(FileOutputFormat.java:112)
    at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:889)
    at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:850)
    at java.security.AccessController.doPrivileged(Native Method)
    at javax.security.auth.Subject.doAs(Subject.java:396)
    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1121)
    at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:850)
    at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:824)
    at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1261)
    at MaxTemperature.main(MaxTemperature.java:33)

    ReplyDelete
  7. I apologize for dirtying the nice example with a mis-formated xml post. What step do I need to perform in order to add the pom.xml? am a newbie with this blog site.

    ReplyDelete
  8. Here is a blog entry that has the pom.

    http://scottcote.wordpress.com/2013/06/03/maven-configuration-for-eclipse-hadoop-project/

    ReplyDelete
  9. I have modified org.apache.mahout.classifier.sgd.TrainLogistics.java to read data from HDFS. Now I want to use the jar file generated from the code to run in Cygwin and perform Logistic regression.

    I wanted to ask if I have to compile the java file in Cygwin itself and then make a runnable jar file through commands or can I directly run the jar file itself? If it is possible to run the jar directly, how can it be done?

    ReplyDelete
  10. im confusing while giving input and output arrguments...please Help Me

    ReplyDelete
  11. Error,
    Unable to load native-hadoop library for your platform... using builtin-java classes where applicable

    ReplyDelete
  12. I am new to hadoop. I need to debug sample mapreduce code using eclipse.
    For this I have edited hadoop-env.sh with the follwing line
    HADOOP_OPTS="-agentlib:jdwp=transport=dt_socket,server=y,suspend=y,address=50002"

    and in my eclipse Remote Debug configuration wizard,I was given
    host:10.0.2.23
    port:50002

    But while running my code in debug mode, I am getting the following error

    java.net.ConnectException: connection refused

    please help me

    ReplyDelete
  13. Hi Praveen,
    Very helpful guide, thanks.
    Pl let me know how can use Junit to start developing MapReduce application. Hope you must be using it for developing applications. Definitive guide confuses a lot in developing and debugging.

    If possible kindly give some leads on where from I can avail myself some guidance in using Junit etc for mapreduce.

    Sushant

    ReplyDelete
    Replies
    1. I have written on using MRUnit at http://www.thecloudavenue.com/2013/12/UnitTestingMapReduceWithMRUnitFramework.html.

      Delete
  14. hi praveen
    I was able to run this but the wrong result any idea?

    1907 -128
    1914 0
    1990 11

    ReplyDelete
    Replies
    1. The data in Sample.txt and the Eclipse screen shot were not in sync. I have updated the Sample.txt. Download the Sample.txt again and replace it in the project.

      Delete
  15. I have the same problem like pratik
    the output is
    1907 -128
    1914 0
    1990 11

    ReplyDelete
    Replies
    1. The data in Sample.txt and the Eclipse screen shot were not in sync. I have updated the Sample.txt. Download the Sample.txt again and replace it in the project.

      Delete
  16. Hello,
    Can you please provide the tutorial for running the mahout map-reduce code written in eclipse to run on Amazon EC2 instance?

    Thanks...

    ReplyDelete
    Replies
    1. Here

      http://www.thecloudavenue.com/2013/12/MRExecutionWithAmazonEMR.html
      http://www.thecloudavenue.com/2013/12/CustomMRExecutionWithAmazonEMR.html

      Delete
  17. Dear Praveen, your blog is very useful. thank you.
    I follow your way and my output is
    1907 -128
    1914 0
    1990 11
    it looks wrong... Do you have a any idea about the problem?


    ReplyDelete
    Replies
    1. The data in Sample.txt and the Eclipse screen shot were not in sync. I have updated the Sample.txt. Download the Sample.txt again and replace it in the project.

      Delete
    2. Looks like you are using YARN (RM, NM etc) and not the legacy MR Architecture (JT, TT etc). The steps are for the legacy MR and not for YARN.

      Delete
    3. I am having the same issue with java.io.IOException: Cannot initialize Cluster
      I have searched for information on how to switch back from YARN/MR2 to legacy MR1. Unfortunately I am not having any luck.
      Any hints or recommendations would be greatly apprecaiated.

      Delete
  18. i wish to ask if there is a way to see the job log files produced by eclipse as it doesn't appear in logs folder of hadoop

    ReplyDelete
  19. Dear Sir I am using Hadoop 1.0.4 with cygwin installed it on my windows 7 machine.
    and properly configured coresite.xml , hdfs-site.xml and mapred-site.xml and alwasys do ssh loccalhost on cygwin terminal and go to hadoop bin directory and run./start-all.sh to start all services.
    when i am running mapreduce on wordcount example with eclipse on windows and getting errors

    14/01/14 19:14:19 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
    14/01/14 19:14:19 ERROR security.UserGroupInformation: PriviledgedActionException as:Nitesh cause:java.io.IOException: Failed to set permissions of path: \tmp\hadoop-sshd\mapred\staging\Nitesh874657781\.staging to 0700
    Exception in thread "main" java.io.IOException: Failed to set permissions of path: \tmp\hadoop-sshd\mapred\staging\Nitesh874657781\.staging to 0700
    at org.apache.hadoop.fs.FileUtil.checkReturnValue(FileUtil.java:689)
    at org.apache.hadoop.fs.FileUtil.setPermission(FileUtil.java:662)
    at org.apache.hadoop.fs.RawLocalFileSystem.setPermission(RawLocalFileSystem.java:509)
    at org.apache.hadoop.fs.RawLocalFileSystem.mkdirs(RawLocalFileSystem.java:344)
    at org.apache.hadoop.fs.FilterFileSystem.mkdirs(FilterFileSystem.java:189)
    at org.apache.hadoop.mapreduce.JobSubmissionFiles.getStagingDir(JobSubmissionFiles.java:116)
    at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:856)
    at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:850)
    at java.security.AccessController.doPrivileged(Native Method)
    at javax.security.auth.Subject.doAs(Unknown Source)
    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1121)
    at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:850)
    at org.apache.hadoop.mapreduce.Job.submit(Job.java:500)
    at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:530)
    at word.WordCount.main(WordCount.java:37)

    to solve this error I have to do this but i can't able.
    1.Under your hadoop directory go to src->core->org->apache->hadoop->fs
    2.Open FileUtil.Java
    3.Comment out the code inside checkReturnValue function at line 685.
    4.Re create hadoop-core-1.0.4.jar and include it in the build path of your eclipse project

    so can u do this process and provide me a link to download the hadoop-core-1.0.4.jar file please sir.to my mailid:niteshkumardash@gmail.com so that i can go ahead on myproject

    ReplyDelete
  20. Hi Praveen

    It would be helpful if you can help me in the below scenarios.
    we have a 4 node Cloudera cluster setup on w.x.y.z location and I want to login there and run my mapreduce program from my machine but the eclipse setup is not allowing me login there.
    Please explain how can we achieve this.

    ReplyDelete
  21. This comment has been removed by the author.

    ReplyDelete
  22. This comment has been removed by the author.

    ReplyDelete
  23. Hi Praveen,
    I have tried importing the Hadoop code in Eclipse. I would like to perform some modifications to my map reduce framework, for which i need to make some modifications to the Hadoop code. In order to figure out where the modifications are to be made, I need to debug it from Eclipse. I could import Hadoop project into Eclipse successfully. What's the next step to be done?How t run Hadoop from Eclipse?Please do help.

    ReplyDelete
  24. Hi Praveen,
    I am using hadoop-2.4.0 and followed your tutorial. I have following dissimilarities:
    1: warning: job constructor is deprecated.
    2:In run configuration-> java application i see newmaxtemperature instead of maxtemperature.
    3: I get the following error of Job class not found:
    Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/commons/logging/LogFactory
    at org.apache.hadoop.mapreduce.Job.(Job.java:78)
    at NewMaxTemperature.main(NewMaxTemperature.java:58)
    Caused by: java.lang.ClassNotFoundException: org.apache.commons.logging.LogFactory
    at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
    at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
    at java.security.AccessController.doPrivileged(Native Method)
    at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
    at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
    ... 2 more

    ReplyDelete
  25. Thanks a lot for sharing very useful posts and it helped me a lot. please keep post the useful posts related to Hadoop Eco System.. Thanks once again.. :)

    ReplyDelete