Monday, November 25, 2013

Installing/configuring HCatalog and integrating with Pig

As mentioned in the previous blog entry, Hive uses a metastore to store the table details, mapping of the table to the data and other details. Any framework which provides an SQL like interface would require a metastore similar to Hive. Instead of having a separate incompatible metastore for each of these framework, it would be good to have a common metastore across them.This is where HCatalog comes into the picture.
This way the data structures created by one frameworks would be visible to others. Also, by using a common metastore a data analysts can more concentrate on the analytics part than on the location/format of the data.

Using open source is really nice, but sometimes the documentation lacks details and when it comes to integrating different frameworks it lacks even more and there are also the compatibility issues between the different frameworks. So, this entry is about how to install/configure HCatalog and how to integrate it with Pig. This way Pig and Hive will have a unified metadata view. Anyway, here is the documentation for HCatalog.

HCatalog used to be a separate project under Apache Software Foundation. But, has been moved to Hive project and is included in Hive 11.0. Here are the steps to install HCatalog and configure it. Before following these steps, make sure that Hive uses and external metastore as mentioned in the earlier blog entry for Pig to connect to.

- Download Hive (0.11 or later which includes HCatalog) from here and extract it to the $HIVE_HOME folder.

- First Hive has to be configured in the external configuration way, for Pig to connect to it. Here are more details about how to.

- Make sure HDFS and MR are running and then upload the below file to HDFS.
vi /home/vm4learning/Desktop/2001-01-01-GB.txt
123,log-statement1
124,log-statement2
125,log-statement3
bin/hadoop fs -put /home/vm4learning/Desktop/2001-01-01-GB.txt /user/vm4learning/2001-01-01-GB.txt
- Create a table called logs and add partitions to the table pointing to the file uploaded earlier in HDFS. Note that the tables could also be created using the Hive shell, because HCatalog uses the same Hive metastore.
hcatalog/bin/hcat -e "CREATE TABLE logs (ts BIGINT, line STRING) PARTITIONED BY (dt STRING, country STRING)  ROW FORMAT DELIMITED FIELDS TERMINATED BY ',';"

hcatalog/bin/hcat -e "alter table logs add partition (dt='2001-01-01', country='GB') location '/user/vm4learning/'"
- Run some queries in Hive on the log table to make sure it has been configured properly.

- Update the below property in the $PIG_HOME/conf/pig.properties
pig.load.default.statements=$PIGHOME/.pigbootup
- Register the below jars in the $PIGHOME/.pigbootup file.
REGISTER /home/vm4learning/Installations/hive-0.11.0-bin/hcatalog/share/hcatalog/hcatalog-core-0.11.0.jar;
REGISTER /home/vm4learning/Installations/hive-0.11.0-bin/lib/hive-exec-0.11.0.jar;
REGISTER /home/vm4learning/Installations/hive-0.11.0-bin/lib/hive-metastore-0.11.0.jar;
- Add the below in the $HOME/.bashrc file or execute them from a terminal before starting grunt. This enables Pig to connect to the HCatalog/Hive metastore.
export PIG_OPTS=-Dhive.metastore.uris=thrift://localhost:9083
export PIG_CLASSPATH=$HCAT_HOME/share/hcatalog/*:$HIVE_HOME/lib/*
- Now the log table created using HCatalog should be visible to Pig. Start grunt and execute the below commands.
A = load 'logs' using org.apache.hcatalog.pig.HCatLoader();
dump A;
In an upcoming blog, we will look into how to integrate MR with HCatalog.

5 comments:

  1. 2014-06-23 13:32:39,570 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1070: Could not resolve org.apache.hcatalog.pig.HCatLoader using imports: [, org.apache.pig.builtin., org.apache.pig.impl.builtin.]
    Details at logfile: /home/madhu/pig_1403510533655.log

    ReplyDelete
    Replies
    1. export PIG_OPTS=-Dhive.metastore.uris=thrift://localhost:9083
      export PIG_CLASSPATH=$HCAT_HOME/share/hcatalog/hcatalog-pig-adapter-0.12.0.jar
      export HCAT_HOME=/home/madhu/work/hive-0.12.0/hcatalog


      let me know whether my path setting is corret or not

      Delete
  2. Hi Praveen,
    Thanks for your instruction. There are indeed very few documents about how to make pig integarted with hive HCatalog. I have been strugling with this for a while.

    Here are my situation. Please provide help me when you got a chance. I have a hadoop set up at Amazon AWS. I had make the Hive server2 setup and the metastore service setup in my cluster. I could run the following commands:

    1) hcat -e "use testdb; describe hcatalogtest;"
    2) beeline -u jdbc:hive2://localhost:10000/default -n ubuntu -p ubuntu

    I setup everything according to what you listed. I still failed when I tried to run the pig command (pig started with "pig -useHCatalog"):
    A = LOAD 'testdb.hcatalogtest' USING org.apache.hive.hcatalog.pig.HCatLoader();

    The error message is:
    [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 2245: Cannot get schema from loadFunc org.apache.hive.hcatalog.pig.HCatLoader

    It looks like there is a problem to connect to the hive Metastore server. I checked that my Metastore server is up and port 9083 is listening.

    If you have tips for me to debug, please do give me a hand.
    Thanks in advance!
    Max

    ReplyDelete
  3. Can I install HCatalog without Hive component?

    ReplyDelete
  4. Hi All,

    Whoever is getting following error :-
    [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 2245: Cannot get schema from loadFunc org.apache.hive.hcatalog.pig.HCatLoader

    You must start few services before making the pig/hive/hcatalog run.
    Steps mentioned above are correct, but to add to them :-
    1. Start hiveserver2
    2. Start metastore

    Then launch pig using "pig -useHCatalog" and you are ready to rock.

    ReplyDelete