Saturday, November 23, 2013

Imapla or Hive - when to use what?

Hive has been initially developed by Facebook and later released to the Apache Software Foundation. Here is a paper from Facebook on the same. Impala from Cloudera is based on the Google Dremel paper. Both, Impala and Hive provide a SQL type of abstraction for data analytics for data on on top of HDFS and use the Hive metastore. So, when to use Hive and when to use Impala?

Here is a discussion on Quora on the same. Here is a snippet from the Cloudera Impala FAQ

Impala is well-suited to executing SQL queries for interactive exploratory analytics on large datasets. Hive and MapReduce are appropriate for very long running, batch-oriented tasks such as ETL.

And here is a nice presentation which summarizes to the point about Hive vs Imapala. So, I won't be repeating them again in this blog.



Note that performance is not the only non-functional-requirement for picking a patricular framework. Also, the Big Data had been moving rapidly and the comparison results might trip the other way in the future as more improvements are made to the corresponding framework.

4 comments:

  1. I have a quick doubt here. Can we install Impala on an Apache Hadoop distribution. I am using Hadoop 1.0.4 and Hive 0.9. I saw people saying that Impala works only with CDH or Hadoop 2.0. Is this true? Thanks for the post

    ReplyDelete
    Replies
    1. I think it is possible, i am actually working on it there is impala user group :

      Impala user list:
      "impala-user@cloudera.org"

      Delete
  2. yes, its a cloudera product, you can check out pheonix which has similar features as that of impala. also there are plenty of projects which are implementing SQL for hadoop.

    ReplyDelete
    Replies
    1. Hi !
      I don't think it is impossible, from the description of the product you just need impala daemon on each hosts and synchronize them, anyway i am working on it actually and get back the conclusion,

      Delete