Wednesday, January 11, 2012

Getting started with HDFS Federation

HDFS Federation enables multiple NameNodes in a cluster for horizontal scalability of NameNode. All these NameNodes work independently and don't require any co-ordination. A DataNode can register with multiple NameNodes in the cluster and can store the data blocks for multiple NameNodes.

Here are the instructions to set HDFS Federation on a cluster.

1) Download the Hadoop 0.23 release from here.

2) Extract it to a folder (let's call it HADOOP_HOME) on all the NameNodes (namenode1, namenode2) and on all the slaves (slave1, slave2, slave3)

3) Add the association between the hostnames and the ip address for the NameNodes and the slaves on all the nodes in the /etc/hosts file. Make sure that the all the nodes in the cluster are able to ping to each other.

4) Make sure that the NameNodes is able to do a password-less ssh to all the slaves. Here are the instructions for the same.

5) Add the following to .bashrc in the home folder for the NameNodes and all the slaves.
export HADOOP_DEV_HOME=/home/praveensripati/Installations/hadoop-0.23.0
export HADOOP_COMMON_HOME=$HADOOP_DEV_HOME
export HADOOP_HDFS_HOME=$HADOOP_DEV_HOME
export HADOOP_CONF_DIR=$HADOOP_DEV_HOME/conf
6) For some reason the properties set in the .bashrc are not being picked by the Hadoop scripts and need to be explicitly set at the beginning of the scripts.

Add the below in libexec/hadoop-config.sh
export JAVA_HOME=/usr/lib/jvm/jdk1.6.0_27
7) Create a tmp folder in HADOOP_HOME folder on the NameNodes and on all the slaves.

8) Create the following configuration files in the $HADOOP_HOME/conf folder on the NameNodes and on all the slaves.

core-site.xml
<?xml version="1.0"?>
<configuration>
    <property>
        <name>hadoop.tmp.dir</name>
        <value>/home/praveensripati/Installations/hadoop-0.23.0/tmp</value>
    </property>
</configuration>
 hdfs-site.xml
<?xml version="1.0"?>
<configuration>
    <property>
        <name>dfs.replication</name>
        <value>2</value>
    </property>
    <property>
        <name>dfs.permissions</name>
        <value>false</value>
    </property>
    <property>
        <name>dfs.federation.nameservices</name>
        <value>ns1,ns2</value>
    </property>
    <property>
        <name>dfs.namenode.rpc-address.ns1</name>
        <value>namenode1:9001</value>
      </property>
    <property>
        <name>dfs.namenode.rpc-address.ns2</name>
        <value>namenode2:9001</value>
      </property>
</configuration> 
9) Add the slave host names to the conf/slaves file on the NameNode on which the command in step 10 will be run.
slave1
slave2
slave3
10) Start the Hadoop daemons. This command will start the NameNodes on the nodes specified in the hdfs-site.xml and also start the DataNodes on the nodes specified in the conf/slaves file.
sbin/start-dfs.sh
11) The NameNode should be started on namenode1 and namenode2 and the DataNode on all the slaves.

11) Check the log files on all the NameNodes and on the Slaves for any errors.

12) Check the Home page for all the NameNode. The number of DataNodes should be equal to the number of slaves.
http://namenode1:50070/dfshealth.jsp
http://namenode2:50070/dfshealth.jsp
13) Stop the Hadoop daemons.
sbin/stop-dfs.sh

Once multiple NameNodes are up and running, client side mount tables can be used to get a unified view of the different namespaces to the application. Getting started with mount tables coming soon :)

1 comment:

  1. Praveen,

    I have couple of questions.

    1.When a client interact to the cluster which namenode will respond? how the mechanism works?
    2.will metadata store on Namenode or Datanode?
    3.Can i say namenode now has a backup is it a right statement?

    Regards,
    AK.

    ReplyDelete