Tuesday, January 3, 2012

WhatsWrong : DataNode on remote machine not able to connect to NameNode


I have setup a Hadoop cluster as shown above with `NameNode + DataNode` on one node and `DataNode` on a different node with the below configuration files on both the nodes.

core-site.xml

<?xml version="1.0"?>
<configuration>
     <property>
         <name>fs.default.name</name>
         <value>hdfs://localhost:9000</value>
     </property>
</configuration>

hdfs-site.xml

<?xml version="1.0"?>
<configuration>
     <property>
         <name>dfs.replication</name>
         <value>2</value>
     </property>
</configuration>

The DataNode on the remote machine is not able to connect to the NameNode and here is the error in the hadoop-praveensripati-datanode-Node2.log  file on the Node2, where Node1 is the hostname of the node which has the NameNode.

2012-01-03 16:57:57,924 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: Node1/192.168.0.101:9000. Already tried 0 time(s).
2012-01-03 16:57:58,926 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: Node1/192.168.0.101:9000. Already tried 1 time(s).
2012-01-03 16:57:59,928 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: Node1/192.168.0.101:9000. Already tried 2 time(s).

Made sure that

- Both the nodes can ping each other.
- Successfully ssh'd from the master to the slave.
- Configured the `/etc/hosts` and `/etc/hostname` properly.
- `netstat -a | grep 9000` gives the below output.

tcp        0      0 localhost:9000          *:*                     LISTEN     
tcp        0      0 localhost:9000          localhost:33476         ESTABLISHED
tcp        0      0 localhost:33571         localhost:9000          TIME_WAIT  
tcp        0      0 localhost:33476         localhost:9000          ESTABLISHED

What's wrong with the above setup?

Respond back in the comments and I will give a detailed explanation once I get a proper response.

8 comments:

  1. I think fs.default.name should point to node1 ip address instaed of localhost.

    ReplyDelete
  2. As to why, the source code looks like the following for fs.default.name set to localhost

    ServerSocket socket = new ServerSocket(9000);
    socket.bind(localhost);

    Because bind address is assigned to localhost, the namenode process only can accept connection from localhost. If bind address is assigned to the name of machine name or ip address, then namenode process can accept any connection from remote machine.

    ReplyDelete
  3. i tried to setup a cluster , with namenode on one node, 3 datanodes and i set fs.default.name to localhost on namenode and master ip address on datanodes. but i am getting the same error. can anyone please help me.

    ReplyDelete
  4. > i set fs.default.name to localhost on namenode

    Set it to hostname/ip as mentioned in the first comment instead of the localhost.

    ReplyDelete
  5. Yes, i replaced and it worked, thank you

    ReplyDelete
  6. Hi, would this also be the case for the mapred-site.xml. That is, would putting localhost:9001 in the mapred-site.xml also cause bind problems for a datanode trying to connect to the namenode?

    ReplyDelete
    Replies
    1. Chris,

      The same would be applicable for the mapred-site.xml also. If localhost:9001 is mentioned in the configuration file, then the remote tasktracker won't be able to talk to the jobtracker.

      Praveen

      Delete
    2. Thanks Praveen, this helped finally get my hadoop cluster up and running

      Delete