Wednesday, January 11, 2012

Getting started with NextGen MapReduce (cluster) in easy steps

Earlier blog entry covered setting up MRv2 (NextGen MapReduce) on a single node in easy steps. Now, we will go through the steps for configuring MRv2 on a cluster of nodes. In this configuration one of the node will act as a master (let's call it master) and rest of the nodes as slaves (slave1, slave2, .....). Master will be running the ResourceManager and the NameNode daemons, while the slaves have the NodeManager and the DataNode running. As in the case of classic MR, there are no JobTracker and TaskTracker daemons in MRv2.


Note that once a Hadoop environment has been created successfully on a single slave an image can be created and deployed on multiple slaves easily.

The below instructions have been tried on Ubuntu 11.10 and will apply more or less for other OS also. The instructions which are unique to setup NextGen MapReduce in a cluster are marked with *.

1) Download the Hadoop 2x release from here.

2) Extract it to a folder (let's call it HADOOP_HOME) on the master and on all the slaves.

*3) Add the association between the hostnames and the ip address for the master and the slaves on all the nodes in the /etc/hosts file. Make sure that the all the nodes in the cluster are able to ping to each other.

*4) Make sure that the master is able to do a password-less ssh to all the slaves. Here are the instructions for the same.

5) Add the following to .bashrc in the home folder for the master and all the slaves.
export HADOOP_HOME=/home/praveensripati/Installations/hadoop-0.23.0
export HADOOP_MAPRED_HOME=$HADOOP_HOME
export HADOOP_COMMON_HOME=$HADOOP_HOME
export HADOOP_HDFS_HOME=$HADOOP_HOME
export YARN_HOME=$HADOOP_HOME
export HADOOP_CONF_DIR=$HADOOP_HOME/conf
export YARN_CONF_DIR=$HADOOP_HOME/conf
*6) For some reason the properties set in the .bashrc are not being picked by the Hadoop scripts and need to be explicitly set at the beginning of the scripts.

Add the below in libexec/hadoop-config.sh
export JAVA_HOME=/usr/lib/jvm/jdk1.6.0_27
Add the below in conf/yarn-env.sh
export JAVA_HOME=/usr/lib/jvm/jdk1.6.0_27
export HADOOP_CONF_DIR=/home/praveensripati/Installations/hadoop-0.23.0/conf
export HADOOP_COMMON_HOME=/home/praveensripati/Installations/hadoop-0.23.0
export HADOOP_HDFS_HOME=/home/praveensripati/Installations/hadoop-0.23.0
7) Create a tmp folder in HADOOP_HOME folder on the master and on all the slaves.

*8) Create the following configuration files in the $HADOOP_HOME/conf folder on the master and on all the slaves. Note the additional parameters in the yarn-site.xml for the NodeManager to know the location of the ResourceManager. Also, change the configuration parameter dfs.replication to the number of slave nodes.

core-site.xml
<?xml version="1.0"?>
<configuration>
    <property>
        <name>fs.default.name</name>
        <value>hdfs://master:9000</value>
    </property>
    <property>
        <name>hadoop.tmp.dir</name>
        <value>/home/praveensripati/Installations/hadoop-0.23.0/tmp</value>
    </property>
</configuration>
hdfs-site.xml
<?xml version="1.0"?>
<configuration>
    <property>
        <name>dfs.replication</name>
        <value>2</value>
    </property>
    <property>
        <name>dfs.permissions</name>
        <value>false</value>
    </property>
</configuration>
mapred-site.xml
<?xml version="1.0"?>
<configuration>
    <property>
        <name>mapreduce.framework.name</name>
        <value>yarn</value>
    </property>
</configuration>
yarn-site.xml
<?xml version="1.0"?>
<configuration>
    <property>
        <name>yarn.nodemanager.aux-services</name>
        <value>mapreduce_shuffle</value>
    </property>
    <property>
        <name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>
        <value>org.apache.hadoop.mapred.ShuffleHandler</value>
    </property>
    <property>
        <name>yarn.resourcemanager.resource-tracker.address</name>
        <value>master:8025</value>
    </property>
    <property>
        <name>yarn.resourcemanager.scheduler.address</name>
        <value>master:8030</value>
    </property>
    <property>
        <name>yarn.resourcemanager.address</name>
        <value>master:8040</value>
    </property>
</configuration>
*9) Add the names of the slave host names to the conf/slaves file on the master.
slave1
slave2
slave3
10) Format the NameNode on the master
bin/hadoop namenode -format
*11) Start the Hadoop daemons. Note that for the datanode and the nodemanager, the script is *daemons.sh and not daemon.sh. This *daemons.sh script goes through the `conf/slaves` file and starts the daemons on all the slaves using the password less ssh.
sbin/hadoop-daemon.sh start namenode
sbin/hadoop-daemons.sh start datanode
bin/yarn-daemon.sh start resourcemanager
bin/yarn-daemons.sh start nodemanager
bin/yarn-daemon.sh start historyserver
*12) Time to check if the installation has been a success or not

    a) Check the log files in the $HADOOP_HOME/logs folder for any errors on the master and the slaves.

    b) The following  consoles should come up

        http://master:50070/dfshealth.jsp (the # DataNodes reported should be correct in http://master:50070/dfsnodelist.jsp?whatNodes=LIVE)
        http://master:8088/cluster (the # of NodeManagers reported should be correct in http://master:8088/cluster/nodes)
        http://master:19888/jobhistory (for Job History Server)

    c) Run the jps command to make sure that the daemons are running.

    On the master the o/p of should be
2234 Jps
1989 ResourceManager
2023 NameNode
2060 JobHistoryServer

    On the slaves the o/p of should be
2234 Jps
2023 NodeManager
1856 DataNode
13) Run the example job.
bin/hadoop jar hadoop-mapreduce-examples-0.23.0.jar randomwriter -Dmapreduce.randomwriter.mapsperhost=1 -Dmapreduce.job.user.name=$USER -Dmapreduce.randomwriter.bytespermap=10000 -Ddfs.blocksize=536870912 -Ddfs.block.size=536870912 -libjars modules/hadoop-mapreduce-client-app-0.23.0.jar output
14) Verify that the output folder with the proper contents has been created through the NameNode Web console (http://master:50070/nn_browsedfscontent.jsp).

*15) Stop the daemons once the job has been through successfully.
bin/yarn-daemon.sh stop historyserver
bin/yarn-daemons.sh stop nodemanager
bin/yarn-daemon.sh stop resourcemanager
sbin/hadoop-daemons.sh stop datanode
sbin/hadoop-daemon.sh stop namenode 

6 comments:

  1. Thanks for the nice post ! It saved my day.

    Btw, for Hadoop 0.23.5, the script for running historyserver has been changed to:

    sbin/mr-jobhistory-daemon.sh start historyserver

    ReplyDelete
  2. this is a great post thanks, I have been going off the cdh5 docs and there are a couple of things that they neglect to mention, which this supplements nicely. Notably the fact that you are running the namenode and resource manager on a single node with the rest acting as slaves.
    Thanks.

    ReplyDelete
  3. Thanks for the post. Can you show how you configured the /etc/hosts file? I get some dns errors because my slaves' hostnames are "localhost".

    ReplyDelete
    Replies
    1. vm4learning@vm4learning:~$ cat /etc/hosts
      127.0.0.1 localhost
      #127.0.1.1 vm4learning
      10.0.2.15 vm4learning

      # The following lines are desirable for IPv6 capable hosts
      ::1 ip6-localhost ip6-loopback
      fe00::0 ip6-localnet
      ff00::0 ip6-mcastprefix
      ff02::1 ip6-allnodes
      ff02::2 ip6-allrouters

      Delete
  4. Thanks for a good post.
    I do not find the conf folder in the extracted file.
    The yarn-env.sh is found in the etc/hadoop folder.
    should I create the conf folder and then copy the yarn-env.sh to thd conf folder?

    Thanks.

    ReplyDelete
  5. Hi Pravin,

    I think you don not need to create a conf folder separately, I made my changes in the etc/hadoop folder only and it is just working fine.

    Thanks

    ReplyDelete