Tuesday, January 24, 2012

Is HBase really column oriented?

According to Wikipedia

A column-oriented DBMS is a database management system (DBMS) that stores its content by column rather than by row. This has advantages for data warehouses and library catalogues where aggregates are computed over large numbers of similar data items.
 
Lately, I had been looking at what the column oriented databases HBase and Cassandra are about and the pros and cons of each one of them. One thing which was very clear is that Cassandra is way simpler to setup than HBase, since Cassandra is self contained. HBase depends on HDFS for storage, which is still evolving a bit complex. Another thing is that Cassandra is decentralized and there is no SPOF (Single Point Of Failure), while in HBase the HDFS Name Node and HBase Master are SPOF.

Although HBase is known to be a column oriented database (where the column data stay together), the data in HBase for a particular row stay together and the column data is spread and not together.

Let's go into the details. In HBase, the cell data in a table is stored as a key/value pair in the HFile and the HFile is stored in HDFS. More details about the data model are present in the Google BigTable paper. Also, Lars (author of HBase - The Definitive Guide) does a very good job of explaining the storage layout.

Below is one of the key/value pair stored in the HFile which represents a cell in a table.

K: row-550/colfam1:50/1309812287166/Put/vlen=3 V: 501

where

`row key` is `row-550`
`column family` is `colfam1`
`column family identifier (aka column)` is `50`
`time stamp` is `1309812287166`
`value` stored is `501`.

The dump of a HFile (which stores a lot of key/value pairs) looks like below in the same order

K: row-550/colfam1:50/1309812287166/Put/vlen=2 V: 50
K: row-550/colfam1:51/1309813948222/Put/vlen=2 V: 51
K: row-551/colfam1:30/1309812287200/Put/vlen=2 V: 51
K: row-552/colfam1:31/1309813948256/Put/vlen=2 V: 52
K: row-552/colfam1:49/1309813948280/Put/vlen=2 V: 52
K: row-552/colfam1:51/1309813948290/Put/vlen=2 V: 52

As seen above, the data for a particular row stay together (for ex., all the rows starting with K: row-550/) and the column data is spread and not together (for ex., consider K: row-550/colfam1:51 and K: row-552/colfam1:51 which are in bold above for column name 51). Since the columns are spread the compression algorithms cannot take advantage of the similarities between data of a particular column.

To conclude, although HBase is called column oriented data base, the data of a particular row stick together.

Friday, January 20, 2012

Hadoop World 2011 Videos released by Cloudera

Cloudera has released the Hadoop World 2011 videos.

It covers a wide of topics from the `features in the upcoming Hadoop releases` to `different use cases of Hadoop` to `Security`. Would recommend watching those videos.

Pre-packaged Hadoop softwares (HDP vs CDH)

HortonWorks and Cloudera are two of the active companies working on Hadoop besides LinkedIn, FaceBook and others.

Cloudera had been providing a well integrated, tested package of different Apache frameworks around big data called CDH for a couple of years. Cloudera also provides tools on top of the Apache frameworks (Cloudera Manager) for management, troubleshooting of Hadoop installations. These tools are very important as the size of the Hadoop cluster grows over time. As of this writing CDH has released 3 versions of it's flagship software CDH1, CDH2 and CDH3. Cloudera plans to include MRv2 or the next generation MR architecture in CDH4.

HortonWorks although a bit late announced the plans for the public release of Hortonworks Data Platform (HDP) end of this quarter. HDP is along similar lines of CDH. HDP1 will include Hadoop 1.0.0 (from 0.20.205 branch) and HDP2 based on Hadoop 0.23 release (MRv2). HortonWorks would also be packaging some of the similar management tools like Cloudera Manager and others.

Cloudera has partnered with different vendors (Dell, Oracle and others) to package CDH along with the vendors offerings. Similar announcements can also be expected from HortonWorks. HortonWorks had been actively working with Microsoft on porting Hadoop to Windows Server and Azure.

Getting started with Hadoop with a couple is easy for a POC is easy, but scaling to 10s to 100s to 1000s of machines in a production environment is a challenge and requires some indepth knowledge about Hadoop, best practices, patterns and the ecosystem surrounding it. This is where companies like HortonWorks and Cloudera which have a lot of Apache commitors in their payroll come into play. Besides the packaged software both HortonWorks and Cloudera also be providing commercial support for HDP and CDH.

It would be interesting to see if any other players such as HortonWorks and Cloudera come into play or if some of the big companies acquire companies like HortonWorks and Cloudera.

Thursday, January 12, 2012

Getting started with HDFS client side mount table

With HDFS federation it's possible to have multiple NameNode in a HDFS cluster. While this is good from a NameNode scalability and isolation perspective, it's difficult to manage multiple name spaces from a client application perspective. HDFS client mount table makes multiple names spaces transparent to the client. ViewFs more details on how to use the HDFS client mount table.

Earlier blog entry detailed how to setup HDFS federation. Let's assume the two NameNodes have been setup successfully on namenode1 and namenode2.

Lets map
/NN1Home to hdfs://namenode1:9001/home
and
/NN2Home to hdfs://namenode2:9001/home
Add the following to the core-site.xml
<?xml version="1.0"?>
<configuration>
    <property>
        <name>hadoop.tmp.dir</name>
        <value>/home/praveensripati/Installations/hadoop-0.23.0/tmp</value>
    </property>
    <property>
        <name>fs.default.name</name>
        <value>viewfs:///</value>
    </property>
    <property>
        <name>fs.viewfs.mounttable.default.link./NN1Home</name>
        <value>hdfs://namenode1:9001/home</value>
    </property>
    <property>
        <name>fs.viewfs.mounttable.default.link./NN2Home</name>
        <value>hdfs://namenode2:9001/home</value>
    </property>
</configuration>
Start the cluster with the `sbin/start-dfs.sh` command from the Hadoop Home and make sure the NameNodes and the DataNodes are working properly.

Run the following commands
bin/hadoop fs -put somefile.txt /NN1Home/input
Make sure that somefile.txt is in the hdfs://namenode1:9001/home/input folder from NameNode web console.
bin/hadoop fs -put somefile.txt /NN2Home/output
Make sure that somefile.txt is in the hdfs://namenode2:9001/home/output folder from NameNode web console.

Wednesday, January 11, 2012

Getting started with HDFS Federation

HDFS Federation enables multiple NameNodes in a cluster for horizontal scalability of NameNode. All these NameNodes work independently and don't require any co-ordination. A DataNode can register with multiple NameNodes in the cluster and can store the data blocks for multiple NameNodes.

Here are the instructions to set HDFS Federation on a cluster.

1) Download the Hadoop 0.23 release from here.

2) Extract it to a folder (let's call it HADOOP_HOME) on all the NameNodes (namenode1, namenode2) and on all the slaves (slave1, slave2, slave3)

3) Add the association between the hostnames and the ip address for the NameNodes and the slaves on all the nodes in the /etc/hosts file. Make sure that the all the nodes in the cluster are able to ping to each other.

4) Make sure that the NameNodes is able to do a password-less ssh to all the slaves. Here are the instructions for the same.

5) Add the following to .bashrc in the home folder for the NameNodes and all the slaves.
export HADOOP_DEV_HOME=/home/praveensripati/Installations/hadoop-0.23.0
export HADOOP_COMMON_HOME=$HADOOP_DEV_HOME
export HADOOP_HDFS_HOME=$HADOOP_DEV_HOME
export HADOOP_CONF_DIR=$HADOOP_DEV_HOME/conf
6) For some reason the properties set in the .bashrc are not being picked by the Hadoop scripts and need to be explicitly set at the beginning of the scripts.

Add the below in libexec/hadoop-config.sh
export JAVA_HOME=/usr/lib/jvm/jdk1.6.0_27
7) Create a tmp folder in HADOOP_HOME folder on the NameNodes and on all the slaves.

8) Create the following configuration files in the $HADOOP_HOME/conf folder on the NameNodes and on all the slaves.

core-site.xml
<?xml version="1.0"?>
<configuration>
    <property>
        <name>hadoop.tmp.dir</name>
        <value>/home/praveensripati/Installations/hadoop-0.23.0/tmp</value>
    </property>
</configuration>
 hdfs-site.xml
<?xml version="1.0"?>
<configuration>
    <property>
        <name>dfs.replication</name>
        <value>2</value>
    </property>
    <property>
        <name>dfs.permissions</name>
        <value>false</value>
    </property>
    <property>
        <name>dfs.federation.nameservices</name>
        <value>ns1,ns2</value>
    </property>
    <property>
        <name>dfs.namenode.rpc-address.ns1</name>
        <value>namenode1:9001</value>
      </property>
    <property>
        <name>dfs.namenode.rpc-address.ns2</name>
        <value>namenode2:9001</value>
      </property>
</configuration> 
9) Add the slave host names to the conf/slaves file on the NameNode on which the command in step 10 will be run.
slave1
slave2
slave3
10) Start the Hadoop daemons. This command will start the NameNodes on the nodes specified in the hdfs-site.xml and also start the DataNodes on the nodes specified in the conf/slaves file.
sbin/start-dfs.sh
11) The NameNode should be started on namenode1 and namenode2 and the DataNode on all the slaves.

11) Check the log files on all the NameNodes and on the Slaves for any errors.

12) Check the Home page for all the NameNode. The number of DataNodes should be equal to the number of slaves.
http://namenode1:50070/dfshealth.jsp
http://namenode2:50070/dfshealth.jsp
13) Stop the Hadoop daemons.
sbin/stop-dfs.sh

Once multiple NameNodes are up and running, client side mount tables can be used to get a unified view of the different namespaces to the application. Getting started with mount tables coming soon :)

Tuesday, January 10, 2012

How to setup password-less ssh to the slaves?

For setting up Hadoop on a cluster of machines, the master should be able to do a password-less ssh to start the daemons on all the slaves.

Class MR - master starts TaskTracker and the DataNode on all the slaves.

MRv2 (next generation MR) - master starts NodeManager and the DataNode on all the slaves.

Here are the steps to setup password-ssh. Ensure that port 22 is open on all the slave (`telnet slave-hostname 22` should connect).

1) Install openssh-client on the master
sudo apt-get install openssh-client
2) Install openssh-server on all the slaves
sudo apt-get install openssh-server
3) Generate the ssh key
ssh-keygen -t rsa -P "" -f ~/.ssh/id_rsa
4) Copy the key to all the slaves (replace username appropriately as the user starting the Hadoop daemons). Will be prompted for the password.
ssh-copy-id -i $HOME/.ssh/id_rsa.pub username@slave-hostname
5) If the master also acts a slave (`ssh localhost` should work without a password)
cat $HOME/.ssh/id_rsa.pub >> $HOME/.ssh/authorized_keys 

If hdfs/mapreduce are run as different users then the steps (3,4 and 5) have to be repeated for all the users.

How to test ?

1) Run `ssh user@slave-hostname`. It should get connected without prompting for a password.

Monday, January 9, 2012

What is HDFS?

Here is a nice paper on introduction to HDFS with a clear explanation of the different features implemented. The paper doesn't go much into the API level details.

Most of the HDFS features are covered except for the HDFS federation which was introduced in 0.23 release and HDFS High Availability which will be included in the coming Hadoop release 0.24.

Also, included is the  benchmarking on the Yahoo cluster.

Saturday, January 7, 2012

Hadoop 1.0 Release

There has been a lot of media coverage/hype/attention on Hadoop 1.0 release announced by Apache. There is nothing much fancy about it. It's just a re-branding of the 0.20.2* release and has been in production for quite some time. Here is the announcement from HortonWorks.

Without being quite involved in the Hadoop groups, it's very tough to know about the different releases and what's going into each of them with the pros and cons.

0.20.2* which had been used for quite some time is rebranded as 1.0, while the release with next the generation MapReduce is called 0.23. Hadoop should get the version numbering streamlined for more adoption. There had been a very lengthy discussion in the mailing groups about the versioning without any conclusion.

Edit (10th January, 2012)Cloudera has made it clear about the different Hadoop releases and and a bit of history about them.

Thursday, January 5, 2012

Is Hadoop reinventing the wheel?

In Hadoop 0.23 the resource management (RM) and the application management (AM) have been seperated to make Hadoop more scalable. Though, Hadoop has got enough media attention, looks like there are a couple of alternatives for RM/AM. Here is an interesting mail from the Hadoop groups.

As mentioned earlier Hadoop doesn't solve everything. It's very important to have a good idea of the ecosystem before jumping into an architecture or selecting a framework to meet the requirements.

Although some of the frameworks (like Storm) are independent of Hadoop, most of the new frameworks are dependent on Hadoop for either MapReduce runtime or HDFS file system or implement some of the MR/HDFS API (like MapR) in a different way. Hadoop acts like a kernel (similar to Linux) on which other frameworks run.

So, if you are getting started with Big Data, look for the different alternative solutions and approaches.

WebEx Session on the major improvements in 0.23 release

HortonWorks is conducting a WebEx session on major improvements in 0.23 release on 12th January, 2012 including the next generation MapReduce architecture and the HDFS federation.

Tuesday, January 3, 2012

WhatsWrong : DataNode on remote machine not able to connect to NameNode


I have setup a Hadoop cluster as shown above with `NameNode + DataNode` on one node and `DataNode` on a different node with the below configuration files on both the nodes.

core-site.xml

<?xml version="1.0"?>
<configuration>
     <property>
         <name>fs.default.name</name>
         <value>hdfs://localhost:9000</value>
     </property>
</configuration>

hdfs-site.xml

<?xml version="1.0"?>
<configuration>
     <property>
         <name>dfs.replication</name>
         <value>2</value>
     </property>
</configuration>

The DataNode on the remote machine is not able to connect to the NameNode and here is the error in the hadoop-praveensripati-datanode-Node2.log  file on the Node2, where Node1 is the hostname of the node which has the NameNode.

2012-01-03 16:57:57,924 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: Node1/192.168.0.101:9000. Already tried 0 time(s).
2012-01-03 16:57:58,926 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: Node1/192.168.0.101:9000. Already tried 1 time(s).
2012-01-03 16:57:59,928 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: Node1/192.168.0.101:9000. Already tried 2 time(s).

Made sure that

- Both the nodes can ping each other.
- Successfully ssh'd from the master to the slave.
- Configured the `/etc/hosts` and `/etc/hostname` properly.
- `netstat -a | grep 9000` gives the below output.

tcp        0      0 localhost:9000          *:*                     LISTEN     
tcp        0      0 localhost:9000          localhost:33476         ESTABLISHED
tcp        0      0 localhost:33571         localhost:9000          TIME_WAIT  
tcp        0      0 localhost:33476         localhost:9000          ESTABLISHED

What's wrong with the above setup?

Respond back in the comments and I will give a detailed explanation once I get a proper response.