Note : Also don't forget to do check another entry on how to unit test MR programs with MRUnit here. And here is a screencast for the same.
Distributed applications are by nature difficult to debug, Hadoop is no exception. This blog entry will try to explain how to put break points and debug a user defined Java MapReduce program in Eclipse.
Hadoop support executing a MapReduce job in Standalone, Pseudo-Distributed and Fully-Distributed Mode. As we move from one more to another in the same order, the debugging becomes harder and new bugs are found on the way. Standalone mode with the default Hadoop configuration properties allows MapReduce programs to be debugged in Eclipse.
Step 1: Create a Java Project in Eclipse.
Step 2: For the Java project created in the earlier step, add the following dependencies (commons-configuration-1.6.jar, commons-httpclient-3.0.1.jar, commons-lang-2.4.jar, commons-logging-1.1.1.jar, commons-logging-api-1.0.4.jar, hadoop-core-1.0.3.jar, jackson-core-asl-1.8.8.jar, jackson-mapper-asl-1.8.8.jar and log4j-1.2.15.jar) in Eclipse. The dependencies are available by downloading and extracting a Hadoop release.
Step 3: Copy the MaxTemperature.java, MaxTemperatureMapper.java, MaxTemperatureReducer.java, MaxTemperatureWithCombiner.java, NewMaxTemperature.java to the src folder under the project. The Sample.txt file which contains the input data should be copied to the input folder. The project folder structure should look like below, without any compilation errors.
Step 4: Add the input and the output folder as the arguments to the MaxTemperature.java program.
Step 5: Execute MaxTemepature.java from Eclipse. There should be no exceptions/errors shown in the console. And on refreshing the project, an output folder should appear as should below on successful completion of the MapReduce job. To rerun the program, the output folder has to be deleted.
Step 6: As in the case of any Java program, break points can be put in the MapReduce driver, mapper, reducer code and debugged.
In the upcoming blog, we will see how to include/compile/debug Hadoop code into Eclipse along with the user defined driver, mapper and the reducer code.
Happy Hadooping !!!!
Note (5th March, 2013) : The above instructions have been tried on Ubuntu 12.04 which has all the utilities like chmod and others, which Hadoop uses internally. These tools are not available by default in Windows and you might get error as mentioned in this thread, when trying the steps mentioned in this blog on a Windows machine.
One alternative it to install Cygwin on Windows as mentioned in this tutorial. This might or might not work smoothly.
Microsoft is working very aggressively to port Hadoop to the Windows platform and has released HDInsight recently. Check this and this for more details. This is the best bet for all the Windows fans. Download the HDInsight Server on a Windows machine and try out Hadoop.
Distributed applications are by nature difficult to debug, Hadoop is no exception. This blog entry will try to explain how to put break points and debug a user defined Java MapReduce program in Eclipse.
Hadoop support executing a MapReduce job in Standalone, Pseudo-Distributed and Fully-Distributed Mode. As we move from one more to another in the same order, the debugging becomes harder and new bugs are found on the way. Standalone mode with the default Hadoop configuration properties allows MapReduce programs to be debugged in Eclipse.
Step 1: Create a Java Project in Eclipse.
Step 2: For the Java project created in the earlier step, add the following dependencies (commons-configuration-1.6.jar, commons-httpclient-3.0.1.jar, commons-lang-2.4.jar, commons-logging-1.1.1.jar, commons-logging-api-1.0.4.jar, hadoop-core-1.0.3.jar, jackson-core-asl-1.8.8.jar, jackson-mapper-asl-1.8.8.jar and log4j-1.2.15.jar) in Eclipse. The dependencies are available by downloading and extracting a Hadoop release.
Step 3: Copy the MaxTemperature.java, MaxTemperatureMapper.java, MaxTemperatureReducer.java, MaxTemperatureWithCombiner.java, NewMaxTemperature.java to the src folder under the project. The Sample.txt file which contains the input data should be copied to the input folder. The project folder structure should look like below, without any compilation errors.
Step 4: Add the input and the output folder as the arguments to the MaxTemperature.java program.
Step 5: Execute MaxTemepature.java from Eclipse. There should be no exceptions/errors shown in the console. And on refreshing the project, an output folder should appear as should below on successful completion of the MapReduce job. To rerun the program, the output folder has to be deleted.
Step 6: As in the case of any Java program, break points can be put in the MapReduce driver, mapper, reducer code and debugged.
In the upcoming blog, we will see how to include/compile/debug Hadoop code into Eclipse along with the user defined driver, mapper and the reducer code.
Happy Hadooping !!!!
Note (5th March, 2013) : The above instructions have been tried on Ubuntu 12.04 which has all the utilities like chmod and others, which Hadoop uses internally. These tools are not available by default in Windows and you might get error as mentioned in this thread, when trying the steps mentioned in this blog on a Windows machine.
One alternative it to install Cygwin on Windows as mentioned in this tutorial. This might or might not work smoothly.
Microsoft is working very aggressively to port Hadoop to the Windows platform and has released HDInsight recently. Check this and this for more details. This is the best bet for all the Windows fans. Download the HDInsight Server on a Windows machine and try out Hadoop.
Thanks for the tutorial, simple and accurate.
ReplyDeleteI think the source code files you' re pointing to are missing,
but anyway pretty helpful stuff.
Thanks for tip - Fixed it - I am always in a hurry to delete files :)
DeleteHow to write output dir onto HDFS server .. using above process what is code. ?
ReplyDeleteOutput has to be written in HDFS right ? but I can see the data in local, how ?
Deleteis there a way to switch between the standalone and pseudo-distributed modes?
ReplyDeleteI have tried Eclipse with standalone mode and not with other modes.
DeletePraveen
This comment has been removed by the author.
ReplyDeleteI did follow the steps in Ubuntu 12.04 which has the utilities like chmod and others, while the utilities are not there in Windows. So, cygwin has to be installed on Windows. Check the below thread from the Hadoop forums
Deletehttp://mail-archives.apache.org/mod_mbox/hadoop-common-user/201105.mbox/%3CBANLkTin-8+z8uYBTdmaa4cvxz4JzM14VfA@mail.gmail.com%3E
Thanks for the tip, I will update the blog accordingly.
Here is a tutorial on installing Cygwin/Hadoop on Windows
Deletehttp://v-lad.org/Tutorials/Hadoop/03%20-%20Prerequistes.html
But, I wouldn't bet on this approach since Microsoft is aggressively working to make Hadoop work on Windows.
Thanks for the reply......
DeleteI just removed my emp id from the log and posting the question here again to help other to know what the question is....?
Pardon me if you are gettign confused?
13/03/06 19:37:23 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
13/03/06 19:37:23 ERROR security.UserGroupInformation: PriviledgedActionException as:cause:java.io.IOException: Failed to set permissions of path: \tmp\hadoop-\mapred\staging\.staging to 0700
Exception in thread "main" java.io.IOException: Failed to set permissions of path: \tmp\hadoop-madhu\mapred\staging\275\.staging to 0700
at org.apache.hadoop.fs.FileUtil.checkReturnValue(FileUtil.java:689)
at org.apache.hadoop.fs.FileUtil.setPermission(FileUtil.java:662)
at org.apache.hadoop.fs.RawLocalFileSystem.setPermission(RawLocalFileSystem.java:509)
at org.apache.hadoop.fs.RawLocalFileSystem.mkdirs(RawLocalFileSystem.java:344)
at org.apache.hadoop.fs.FilterFileSystem.mkdirs(FilterFileSystem.java:189)
at org.apache.hadoop.mapreduce.JobSubmissionFiles.getStagingDir(JobSubmissionFiles.java:116)
at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:856)
at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:850)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1121)
at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:850)
at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:824)
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1261)
at maxtemp.MaxTemperature.main(MaxTemperature.java:36)
waiting for your tip to run hadoop map reduce on windows + eclipse...?
DeleteThanks in advance...
Thank You for your posts. I am doing my Master Thesis now and your web entries are very helpful.
ReplyDelete@Madhu Reddy
Why to drill someone's head about windows? Better try to run it on linux. Try to prepare ready-made virtual linux image with pre-installed everything what's needed or try to find one (Virtual Box). It will save your time and you could use this experience in the future when you'll run hadoop for some real work.
Also, Microsoft recently announced HDInsights (http://goo.gl/1LoHk) for running Hadoop on Windows Server and Azure. So, Hadoop will be natively supported on Windows by Microsoft
DeleteBut, not sure how many instances we will see of Hadoop/Windows. Hadoop clusters run into 10/100/1000s nodes and the Windows license has to got for all the machines.
I followed your tutorial, I thought to the letter. I am running into some pretty heavy errors - below. Any thoughts?
ReplyDelete13/04/21 22:09:27 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
13/04/21 22:09:27 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same.
13/04/21 22:09:27 WARN mapred.JobClient: No job jar file set. User classes may not be found. See JobConf(Class) or JobConf#setJar(String).
13/04/21 22:09:27 INFO mapred.JobClient: Cleaning up the staging area file:/tmp/hadoop-hduser/mapred/staging/hduser1546474700/.staging/job_local_0001
Exception in thread "main" java.lang.IllegalArgumentException: Wrong FS: file://home/hduser/workspace/output, expected: file:///
at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:381)
at org.apache.hadoop.fs.FileSystem.makeQualified(FileSystem.java:294)
at org.apache.hadoop.fs.FilterFileSystem.makeQualified(FilterFileSystem.java:85)
at org.apache.hadoop.mapred.FileOutputFormat.checkOutputSpecs(FileOutputFormat.java:112)
at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:889)
at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:850)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1121)
at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:850)
at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:824)
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1261)
at MaxTemperature.main(MaxTemperature.java:33)
I apologize for dirtying the nice example with a mis-formated xml post. What step do I need to perform in order to add the pom.xml? am a newbie with this blog site.
ReplyDeleteHere is a blog entry that has the pom.
ReplyDeletehttp://scottcote.wordpress.com/2013/06/03/maven-configuration-for-eclipse-hadoop-project/
I have modified org.apache.mahout.classifier.sgd.TrainLogistics.java to read data from HDFS. Now I want to use the jar file generated from the code to run in Cygwin and perform Logistic regression.
ReplyDeleteI wanted to ask if I have to compile the java file in Cygwin itself and then make a runnable jar file through commands or can I directly run the jar file itself? If it is possible to run the jar directly, how can it be done?
im confusing while giving input and output arrguments...please Help Me
ReplyDeleteError,
ReplyDeleteUnable to load native-hadoop library for your platform... using builtin-java classes where applicable
I am new to hadoop. I need to debug sample mapreduce code using eclipse.
ReplyDeleteFor this I have edited hadoop-env.sh with the follwing line
HADOOP_OPTS="-agentlib:jdwp=transport=dt_socket,server=y,suspend=y,address=50002"
and in my eclipse Remote Debug configuration wizard,I was given
host:10.0.2.23
port:50002
But while running my code in debug mode, I am getting the following error
java.net.ConnectException: connection refused
please help me
Hi Praveen,
ReplyDeleteVery helpful guide, thanks.
Pl let me know how can use Junit to start developing MapReduce application. Hope you must be using it for developing applications. Definitive guide confuses a lot in developing and debugging.
If possible kindly give some leads on where from I can avail myself some guidance in using Junit etc for mapreduce.
Sushant
I have written on using MRUnit at http://www.thecloudavenue.com/2013/12/UnitTestingMapReduceWithMRUnitFramework.html.
Deletehi praveen
ReplyDeleteI was able to run this but the wrong result any idea?
1907 -128
1914 0
1990 11
The data in Sample.txt and the Eclipse screen shot were not in sync. I have updated the Sample.txt. Download the Sample.txt again and replace it in the project.
DeleteI have the same problem like pratik
ReplyDeletethe output is
1907 -128
1914 0
1990 11
The data in Sample.txt and the Eclipse screen shot were not in sync. I have updated the Sample.txt. Download the Sample.txt again and replace it in the project.
DeleteHello,
ReplyDeleteCan you please provide the tutorial for running the mahout map-reduce code written in eclipse to run on Amazon EC2 instance?
Thanks...
Here
Deletehttp://www.thecloudavenue.com/2013/12/MRExecutionWithAmazonEMR.html
http://www.thecloudavenue.com/2013/12/CustomMRExecutionWithAmazonEMR.html
Dear Praveen, your blog is very useful. thank you.
ReplyDeleteI follow your way and my output is
1907 -128
1914 0
1990 11
it looks wrong... Do you have a any idea about the problem?
The data in Sample.txt and the Eclipse screen shot were not in sync. I have updated the Sample.txt. Download the Sample.txt again and replace it in the project.
DeleteLooks like you are using YARN (RM, NM etc) and not the legacy MR Architecture (JT, TT etc). The steps are for the legacy MR and not for YARN.
DeleteI am having the same issue with java.io.IOException: Cannot initialize Cluster
DeleteI have searched for information on how to switch back from YARN/MR2 to legacy MR1. Unfortunately I am not having any luck.
Any hints or recommendations would be greatly apprecaiated.
i wish to ask if there is a way to see the job log files produced by eclipse as it doesn't appear in logs folder of hadoop
ReplyDeleteDear Sir I am using Hadoop 1.0.4 with cygwin installed it on my windows 7 machine.
ReplyDeleteand properly configured coresite.xml , hdfs-site.xml and mapred-site.xml and alwasys do ssh loccalhost on cygwin terminal and go to hadoop bin directory and run./start-all.sh to start all services.
when i am running mapreduce on wordcount example with eclipse on windows and getting errors
14/01/14 19:14:19 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
14/01/14 19:14:19 ERROR security.UserGroupInformation: PriviledgedActionException as:Nitesh cause:java.io.IOException: Failed to set permissions of path: \tmp\hadoop-sshd\mapred\staging\Nitesh874657781\.staging to 0700
Exception in thread "main" java.io.IOException: Failed to set permissions of path: \tmp\hadoop-sshd\mapred\staging\Nitesh874657781\.staging to 0700
at org.apache.hadoop.fs.FileUtil.checkReturnValue(FileUtil.java:689)
at org.apache.hadoop.fs.FileUtil.setPermission(FileUtil.java:662)
at org.apache.hadoop.fs.RawLocalFileSystem.setPermission(RawLocalFileSystem.java:509)
at org.apache.hadoop.fs.RawLocalFileSystem.mkdirs(RawLocalFileSystem.java:344)
at org.apache.hadoop.fs.FilterFileSystem.mkdirs(FilterFileSystem.java:189)
at org.apache.hadoop.mapreduce.JobSubmissionFiles.getStagingDir(JobSubmissionFiles.java:116)
at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:856)
at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:850)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Unknown Source)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1121)
at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:850)
at org.apache.hadoop.mapreduce.Job.submit(Job.java:500)
at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:530)
at word.WordCount.main(WordCount.java:37)
to solve this error I have to do this but i can't able.
1.Under your hadoop directory go to src->core->org->apache->hadoop->fs
2.Open FileUtil.Java
3.Comment out the code inside checkReturnValue function at line 685.
4.Re create hadoop-core-1.0.4.jar and include it in the build path of your eclipse project
so can u do this process and provide me a link to download the hadoop-core-1.0.4.jar file please sir.to my mailid:niteshkumardash@gmail.com so that i can go ahead on myproject
Hi Praveen
ReplyDeleteIt would be helpful if you can help me in the below scenarios.
we have a 4 node Cloudera cluster setup on w.x.y.z location and I want to login there and run my mapreduce program from my machine but the eclipse setup is not allowing me login there.
Please explain how can we achieve this.
This comment has been removed by the author.
ReplyDeleteThis comment has been removed by the author.
ReplyDeleteHi Praveen,
ReplyDeleteI have tried importing the Hadoop code in Eclipse. I would like to perform some modifications to my map reduce framework, for which i need to make some modifications to the Hadoop code. In order to figure out where the modifications are to be made, I need to debug it from Eclipse. I could import Hadoop project into Eclipse successfully. What's the next step to be done?How t run Hadoop from Eclipse?Please do help.
Hi Praveen,
ReplyDeleteI am using hadoop-2.4.0 and followed your tutorial. I have following dissimilarities:
1: warning: job constructor is deprecated.
2:In run configuration-> java application i see newmaxtemperature instead of maxtemperature.
3: I get the following error of Job class not found:
Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/commons/logging/LogFactory
at org.apache.hadoop.mapreduce.Job.(Job.java:78)
at NewMaxTemperature.main(NewMaxTemperature.java:58)
Caused by: java.lang.ClassNotFoundException: org.apache.commons.logging.LogFactory
at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
... 2 more
Thanks a lot for sharing very useful posts and it helped me a lot. please keep post the useful posts related to Hadoop Eco System.. Thanks once again.. :)
ReplyDelete