Monday, December 24, 2012

Introduction to Apache Hive and Pig

Apache Hive is a framework that sits on top of Hadoop for doing ad-hoc queries on data in Hadoop. Hive supports HiveQL which is similar to SQL, but doesn't support the complete constructs of SQL.

Hive coverts the HiveQL query into Java MapReduce program and then submits it to the Hadoop cluster. The same outcome can be achieved using HiveQL and Java MapReduce, but using Java  MapReduce will required a lot of code to be written/debugged compared to HiveQL. So, it increases the developer productivity to use Hive.

To summarize, Hive through HiveQL language provides a higher level abstraction over Java MapReduce programming. As with any other high level abstraction, there is a bit of performance overhead using HiveQL when compared to Java MapReduce. But the Hive community is working to narrow down this gap for most of the commonly used scenarios.

Along the same line Pig provides a higher level abstraction over MapReduce. Pig supports PigLatin constructs, which is converted into Java MapReduce program and then submitted to the Hadoop cluster.
While HiveQL is a declarative language like SQL, PigLatin is a data flow language. The output of one PigLatin construct can be sent as input to another PigLatin construct and so on.

Some time back, Cloudera published statistics about the workload character in a typical Hadoop cluster and it can be easily observed that Pig and Hive jobs are a good part of the jobs in Hadoop cluster. Because of the higher developer productivity many of the companies are opting for higher level abstracts like Pig and Hive. So, we can bet there will be a lot of job opening around Hive and Pig when compared to MapReduce developers.
Although Programming Pig book has been published (October, 2011) some time back, Programming Hive book was published recently (October, 2012). For those who have experience working with RDBMS, getting started with Hive would be a better option than getting started with Pig. Also, note that PigLatin language is not very difficult to get started with.

For the underlying Hadoop cluster it's transparent whether a Java MapReduce job is submitted or a MapReduce job is submitted through Hive and Pig. Because of the batch oriented nature of MapReduce jobs, the jobs submitted through Hive and Pig are also batch oriented in nature.

For real time response requirements, Hive and Pig doesn't meet the requirements because of the earlier mentioned batch oriented nature of MapReduce jobs. Cloudera developed Impala which is based on Dremel (publication from Google) for interactive ad-hoc query on top of Hadoop. Impala supports SQL like query and is compatible with HiveQL. So, any applications which are built on top of Hive should work with minimal changes with Impala. The major difference between Hive and Impala is that while HiveQL is converted into Java MapReduce jobs, Impala doesn't covert the SQL query into a Java MapReduce jobs.

Someone can ask whether to go with Pig or Hive for a particular requirement which is a topic for another blog.


  1. Thanks for the blog!! Nice intro to both Pig and Hive, i just have a small doubt here, there will be some limitation for hive in handling different types of data, may i know what type of data does Pig and Hive handles!!

  2. Well, Nice Post !,
    Hive will handles most of the time structured data,and where as pig handles semi-structured data.