Friday, January 6, 2012

Getting started with NextGen MapReduce (single node) in easy steps

Without any detailed explanation of what-is-what which is due for another blog entry, here are simple steps to get started with MRv2 (next generation MapReduce) in easy steps. Find more details about MRv2 details here. So, here are the steps

1) Download the Hadoop 2x release here.

2) Extract it to a folder (let's call it $HADOOP_HOME).

3) Add the following to .bashrc in the home folder.
export HADOOP_HOME=/home/vm4learning/Installations/hadoop-2.2.0
export HADOOP_MAPRED_HOME=$HADOOP_HOME
export HADOOP_COMMON_HOME=$HADOOP_HOME
export HADOOP_HDFS_HOME=$HADOOP_HOME
export YARN_HOME=$HADOOP_HOME
export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop
4) Create the namenode and the datanode folder in $HADOOP_HOME folder.
mkdir -p $HADOOP_HOME/yarn/yarn_data/hdfs/namenode
mkdir -p $HADOOP_HOME/yarn/yarn_data/hdfs/datanode
5) Create the following configuration files in $HADOOP_HOME/etc/hadoop folder.
etc/hadoop/yarn-site.xml
   <property>
      <name>yarn.nodemanager.aux-services</name>
      <value>mapreduce_shuffle</value>
   </property>
   <property>
      <name>yarn.nodemanager.aux-services.mapreduce_shuffle.class</name>
      <value>org.apache.hadoop.mapred.ShuffleHandler</value>
   </property>
etc/hadoop/core-site.xml
   <property>
       <name>fs.default.name</name>
       <value>hdfs://localhost:9000</value>
   </property>
etc/hadoop/hdfs-site.xml
   <property>
      <name>dfs.replication</name>
      <value>1</value>
   </property>
   <property>
      <name>dfs.namenode.name.dir</name>
      <value>file:/home/vm4learning/Installations/hadoop-2.2.0/yarn/yarn_data/hdfs/namenode</value>
   </property>
   <property>
      <name>dfs.datanode.data.dir</name>
      <value>file:/home/vm4learning/Installations/hadoop-2.2.0/yarn/yarn_data/hdfs/datanode</value>
   </property>
etc/hadoop/mapred-site.xml
   <property>
      <name>mapreduce.framework.name</name>
      <value>yarn</value>
   </property>
6) Format the NameNode
bin/hadoop namenode -format
7) Start the Hadoop daemons
sbin/hadoop-daemon.sh start namenode
sbin/hadoop-daemon.sh start datanode
sbin/hadoop-daemon.sh start secondarynamenode
sbin/yarn-daemon.sh start resourcemanager
sbin/yarn-daemon.sh start nodemanager
sbin/mr-jobhistory-daemon.sh start historyserver
8) Time to check if the installation has been a success or not

    a) Check the log files in the $HADOOP_HOME/logs folder for any errors.

    b) The following  consoles should come up
http://localhost:50070/ for NameNode
http://localhost:8088/cluster for ResourceManager
http://localhost:19888/jobhistory for Job History Server
    c) Run the jps command to make sure that the daemons are running.
2234 Jps
1989 ResourceManager
2023 NodeManager
1856 DataNode
2060 JobHistoryServer
1793 NameNode
2049 SecondaryNameNode
9) Create a file and copy it to HDFS
mkdir in

vi in/file
Hadoop is fast
Hadoop is cool

bin/hadoop dfs -copyFromLocal in/ /in
10) Run the example job.
bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.2.0.jar wordcount /in /out
10) Verify that the output folder with the proper contents has been created through the NameNode Web console (http://localhost:50070/dfshealth.jsp) in the /out folder.

11) Stop the daemons once the job has been through successfully.
sbin/hadoop-daemon.sh stop namenode
sbin/hadoop-daemon.sh stop datanode
sbin/hadoop-daemon.sh stop secondarynamenode
sbin/yarn-daemon.sh stop resourcemanager
sbin/yarn-daemon.sh stop nodemanager
sbin/mr-jobhistory-daemon.sh stop historyserver

21 comments:

  1. Thanks for a good post.

    Anyway, I running into a problem and was wondering if it's something you've encounter. So my job history server works fine as long as I have the set dfs.permissions to false like what you have in your post. However, If I remove it, the job history server will say job not found when I click on the job history link.

    Looks like the problem is that the job files are written to hdfs with hdfs as the owner, but the job history server is run as yarn user, which causes permission issues.

    ReplyDelete
    Replies
    1. I am not sure about the problem. I suggest you to post the problem in the Apache Hadoop forums or post it in StackOverflow.

      Delete
  2. This comment has been removed by the author.

    ReplyDelete
  3. while starting nodemanager i got this error ... help please sir,

    Unrecognized option: -jvm
    Could not create the Java virtual machine.

    how to solve this bug. ..

    ReplyDelete
  4. I did completely deployment as mentioned in Ubuntu 12.04 LTS. I am getting the following error, when I am trying to run any mapreduce application using the hadoop-mapreduce-examples-0.23.6.jar

    Hadoop version: 0.23.6

    Container launch failed for container_1364342550899_0001_01_000002 : java.lang.IllegalStateException: Invalid shuffle port number -1 returned for attempt_1364342550899_0001_m_000000_0

    Any help is greatly appreciated...

    ReplyDelete
    Replies
    1. Raja,

      You can fix the "invalid shuffle port number -1" problem by adding
      <property>
          <name>yarn.nodemanager.aux-services</name>
          <value>mapreduce.shuffle</value>
      </property>
      to your yarn-site.xml

      Delete
  5. This comment has been removed by the author.

    ReplyDelete
  6. Praveen,

    Great article , I was able to run in using the steps mentioned by you, as you mentioned in the starting in your post the whys behind the steps are not clear but I was able to install hadoop in the psuedo-distributed mode using your steps.

    I look forward to your next post explaining what each step means and why its necessary , until then I would like to share the following warning message I got when I submitted the job from the examples to Hadoop . Can you please advise why I got this warning message , I am using the Hadoop 2.2 G.A

    bin/hadoop dfs -copyFromLocal in/ /in
    DEPRECATED: Use of this script to execute hdfs command is deprecated.
    Instead use the hdfs command for it.

    13/11/24 22:54:04 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


    ReplyDelete
    Replies
    1. Thanks Jack.

      It's a warning which can be ignored, more details about it here - http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/NativeLibraries.html.

      Here are some resourses to get started with YARN and to know what is what in the this blog - http://www.thecloudavenue.com/p/mrv2resources.html.

      Delete
  7. Hi Praveen,
    I have setup Hadoop 2.2.0 as per the article and executed wordcount example. I am facing the following issue. Can you explain what might have went wrong?

    bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.2.0.jar wordcount /in /out
    13/11/30 07:58:34 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
    13/11/30 07:58:38 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032
    13/11/30 07:58:40 INFO ipc.Client: Retrying connect to server: 0.0.0.0/0.0.0.0:8032. Already tried 0 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1 SECONDS)
    13/11/30 07:58:41 INFO ipc.Client: Retrying connect to server: 0.0.0.0/0.0.0.0:8032. Already tried 1 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1 SECONDS)
    13/11/30 07:58:42 INFO ipc.Client: Retrying connect to server: 0.0.0.0/0.0.0.0:8032. Already tried 2 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1 SECONDS)
    13/11/30 07:58:43 INFO ipc.Client: Retrying connect to server: 0.0.0.0/0.0.0.0:8032. Already tried 3 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1 SECONDS)
    13/11/30 07:58:44 INFO ipc.Client: Retrying connect to server: 0.0.0.0/0.0.0.0:8032. Already tried 4 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1 SECONDS)

    ReplyDelete
    Replies
    1. `yarn.resourcemanager.address` defaults to `0.0.0.0:8032`. So, it looks like the resource manager is not running for some reason. Check the resource manager log file for any error exception.

      Delete
    2. plz can u help me ? i got the same error and i dont know how i fixe it where i will find log file of ressource manager ??

      Delete
  8. Sripati...
    I see the same issue, but I don't see any errors in resource manager logs..

    Krishna..

    Are you able to resolve this issue?

    Thanks

    Chandra

    ReplyDelete
  9. Thank you so much. Your steps worked like a charm. You saved my day :)

    ReplyDelete
  10. Thanks Praveen..your steps worked perfect...Thanks lot

    ReplyDelete
  11. Please help me solve the following error. :

    hduser@ubuntu:/usr/local/hadoop$ hadoop fs -ls
    Java HotSpot(TM) 64-Bit Server VM warning: You have loaded library /usr/local/hadoop/lib/native/libhadoop.so.1.0.0 which might have disabled stack guard. The VM will try to fix the stack guard now.
    It's highly recommended that you fix the library with 'execstack -c ', or link it with '-z noexecstack'.
    14/03/08 09:27:07 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
    ls: Call From ubuntu/127.0.1.1 to localhost:9000 failed on connection exception: java.net.ConnectException: Connection refused; For more details see: http://wiki.apache.org/hadoop/ConnectionRefused

    Regards
    Abdul

    ReplyDelete
  12. May I know how to edit the file system directory in the localhost:50070??

    ReplyDelete
  13. Hi,
    I am trying to debug in standalone mode with following configuration

    fs.default.name
    file:///


    mapred.job.tracker
    local

    but I am getting following errors:
    4/06/04 13:55:08 INFO mapreduce.Job: Job job_1401791587704_0007 failed with state FAILED due to: Application application_1401791587704_0007 failed 2 times due to AM Container for appattempt_1401791587704_0007_000002 exited with exitCode: -1000 due to: File file:/user/hdfs/.staging/job_1401791587704_0007/job.jar does not exist
    .Failing this attempt.. Failing the application.
    14/06/04 13:55:08 INFO mapreduce.Job: Counters: 0
    Exception in thread "main" java.io.IOException: Job failed!
    bash-4.1$ ls -al /user/hdfs/.staging/job_1401791587704_0007/
    total 108
    drwx------. 2 hdfs hadoop 4096 Jun 4 13:55 .
    drwx------. 6 hdfs hadoop 4096 Jun 4 13:55 ..
    -rw-r--r--. 1 hdfs hadoop 7767 Jun 4 13:55 job.jar
    -rw-r--r--. 1 hdfs hadoop 72 Jun 4 13:55 .job.jar.crc
    -rw-r--r--. 1 hdfs hadoop 157 Jun 4 13:55 job.split
    -rw-r--r--. 1 hdfs hadoop 12 Jun 4 13:55 .job.split.crc
    -rw-r--r--. 1 hdfs hadoop 42 Jun 4 13:55 job.splitmetainfo
    -rw-r--r--. 1 hdfs hadoop 12 Jun 4 13:55 .job.splitmetainfo.crc
    -rw-r--r--. 1 hdfs hadoop 67865 Jun 4 13:55 job.xml
    -rw-r--r--. 1 hdfs hadoop 540 Jun 4 13:55 .job.xml.crc
    bash-4.1$ whoami
    hdfs

    What I am missing here? Is it possible to run standalone mode for hadoop 2.2.0 yarn?

    ReplyDelete
  14. bin/hadoop dfs -copyFromLocal in/ /in

    Where i want to run the above command root or home directory or bin directory

    ReplyDelete
  15. Really Nice post. Very helpful
    In addition to what u included in ur post i would like to share some free resources on Hadoop : http://intellipaat.com/blog/setting-up-hadoop-single-node-setup/

    ReplyDelete