Monday, November 25, 2013

Installing/configuring HCatalog and integrating with Pig

As mentioned in the previous blog entry, Hive uses a metastore to store the table details, mapping of the table to the data and other details. Any framework which provides an SQL like interface would require a metastore similar to Hive. Instead of having a separate incompatible metastore for each of these framework, it would be good to have a common metastore across them.This is where HCatalog comes into the picture.
This way the data structures created by one frameworks would be visible to others. Also, by using a common metastore a data analysts can more concentrate on the analytics part than on the location/format of the data.

Using open source is really nice, but sometimes the documentation lacks details and when it comes to integrating different frameworks it lacks even more and there are also the compatibility issues between the different frameworks. So, this entry is about how to install/configure HCatalog and how to integrate it with Pig. This way Pig and Hive will have a unified metadata view. Anyway, here is the documentation for HCatalog.

HCatalog used to be a separate project under Apache Software Foundation. But, has been moved to Hive project and is included in Hive 11.0. Here are the steps to install HCatalog and configure it. Before following these steps, make sure that Hive uses and external metastore as mentioned in the earlier blog entry for Pig to connect to.

- Download Hive (0.11 or later which includes HCatalog) from here and extract it to the $HIVE_HOME folder.

- First Hive has to be configured in the external configuration way, for Pig to connect to it. Here are more details about how to.

- Make sure HDFS and MR are running and then upload the below file to HDFS.
vi /home/vm4learning/Desktop/2001-01-01-GB.txt
bin/hadoop fs -put /home/vm4learning/Desktop/2001-01-01-GB.txt /user/vm4learning/2001-01-01-GB.txt
- Create a table called logs and add partitions to the table pointing to the file uploaded earlier in HDFS. Note that the tables could also be created using the Hive shell, because HCatalog uses the same Hive metastore.

hcatalog/bin/hcat -e "alter table logs add partition (dt='2001-01-01', country='GB') location '/user/vm4learning/'"
- Run some queries in Hive on the log table to make sure it has been configured properly.

- Update the below property in the $PIG_HOME/conf/
- Register the below jars in the $PIGHOME/.pigbootup file.
REGISTER /home/vm4learning/Installations/hive-0.11.0-bin/hcatalog/share/hcatalog/hcatalog-core-0.11.0.jar;
REGISTER /home/vm4learning/Installations/hive-0.11.0-bin/lib/hive-exec-0.11.0.jar;
REGISTER /home/vm4learning/Installations/hive-0.11.0-bin/lib/hive-metastore-0.11.0.jar;
- Add the below in the $HOME/.bashrc file or execute them from a terminal before starting grunt. This enables Pig to connect to the HCatalog/Hive metastore.
export PIG_OPTS=-Dhive.metastore.uris=thrift://localhost:9083
export PIG_CLASSPATH=$HCAT_HOME/share/hcatalog/*:$HIVE_HOME/lib/*
- Now the log table created using HCatalog should be visible to Pig. Start grunt and execute the below commands.
A = load 'logs' using org.apache.hcatalog.pig.HCatLoader();
dump A;
In an upcoming blog, we will look into how to integrate MR with HCatalog.


  1. 2014-06-23 13:32:39,570 [main] ERROR - ERROR 1070: Could not resolve org.apache.hcatalog.pig.HCatLoader using imports: [, org.apache.pig.builtin., org.apache.pig.impl.builtin.]
    Details at logfile: /home/madhu/pig_1403510533655.log

    1. export PIG_OPTS=-Dhive.metastore.uris=thrift://localhost:9083
      export PIG_CLASSPATH=$HCAT_HOME/share/hcatalog/hcatalog-pig-adapter-0.12.0.jar
      export HCAT_HOME=/home/madhu/work/hive-0.12.0/hcatalog

      let me know whether my path setting is corret or not

  2. Hi Praveen,
    Thanks for your instruction. There are indeed very few documents about how to make pig integarted with hive HCatalog. I have been strugling with this for a while.

    Here are my situation. Please provide help me when you got a chance. I have a hadoop set up at Amazon AWS. I had make the Hive server2 setup and the metastore service setup in my cluster. I could run the following commands:

    1) hcat -e "use testdb; describe hcatalogtest;"
    2) beeline -u jdbc:hive2://localhost:10000/default -n ubuntu -p ubuntu

    I setup everything according to what you listed. I still failed when I tried to run the pig command (pig started with "pig -useHCatalog"):
    A = LOAD 'testdb.hcatalogtest' USING org.apache.hive.hcatalog.pig.HCatLoader();

    The error message is:
    [main] ERROR - ERROR 2245: Cannot get schema from loadFunc org.apache.hive.hcatalog.pig.HCatLoader

    It looks like there is a problem to connect to the hive Metastore server. I checked that my Metastore server is up and port 9083 is listening.

    If you have tips for me to debug, please do give me a hand.
    Thanks in advance!