Tuesday, January 24, 2012

Is HBase really column oriented?

According to Wikipedia

A column-oriented DBMS is a database management system (DBMS) that stores its content by column rather than by row. This has advantages for data warehouses and library catalogues where aggregates are computed over large numbers of similar data items.
Lately, I had been looking at what the column oriented databases HBase and Cassandra are about and the pros and cons of each one of them. One thing which was very clear is that Cassandra is way simpler to setup than HBase, since Cassandra is self contained. HBase depends on HDFS for storage, which is still evolving a bit complex. Another thing is that Cassandra is decentralized and there is no SPOF (Single Point Of Failure), while in HBase the HDFS Name Node and HBase Master are SPOF.

Although HBase is known to be a column oriented database (where the column data stay together), the data in HBase for a particular row stay together and the column data is spread and not together.

Let's go into the details. In HBase, the cell data in a table is stored as a key/value pair in the HFile and the HFile is stored in HDFS. More details about the data model are present in the Google BigTable paper. Also, Lars (author of HBase - The Definitive Guide) does a very good job of explaining the storage layout.

Below is one of the key/value pair stored in the HFile which represents a cell in a table.

K: row-550/colfam1:50/1309812287166/Put/vlen=3 V: 501


`row key` is `row-550`
`column family` is `colfam1`
`column family identifier (aka column)` is `50`
`time stamp` is `1309812287166`
`value` stored is `501`.

The dump of a HFile (which stores a lot of key/value pairs) looks like below in the same order

K: row-550/colfam1:50/1309812287166/Put/vlen=2 V: 50
K: row-550/colfam1:51/1309813948222/Put/vlen=2 V: 51
K: row-551/colfam1:30/1309812287200/Put/vlen=2 V: 51
K: row-552/colfam1:31/1309813948256/Put/vlen=2 V: 52
K: row-552/colfam1:49/1309813948280/Put/vlen=2 V: 52
K: row-552/colfam1:51/1309813948290/Put/vlen=2 V: 52

As seen above, the data for a particular row stay together (for ex., all the rows starting with K: row-550/) and the column data is spread and not together (for ex., consider K: row-550/colfam1:51 and K: row-552/colfam1:51 which are in bold above for column name 51). Since the columns are spread the compression algorithms cannot take advantage of the similarities between data of a particular column.

To conclude, although HBase is called column oriented data base, the data of a particular row stick together.