Thursday, November 21, 2013

Different ways of configuring Hive metastore

Apache Hive is a client side library providing a table like abstraction on top of the data in HDFS for data processing. Hive jobs are converted into a MR plan which is then submitted to the Hadoop cluster for execution.

The Hive table definitions and mapping to the data are stored in a metastore. Metastore constitutes of  (1) the meta store service and (2) the database. The metastore service provides the interface to the Hive and the database stores the data definitions, mappings to the data and others.

The metastore (service and database) can be configured in different ways. The default Hive configuration (as is from Apache Hive without any configuration changes) is that Hive driver, metastore interface and the db (derby) all use the same JVM. This configuration is called embedded metastore and is good for the sake of development and unit testing, but won't scale to a production environment as only a single user can connect to the derby database at any instant of time. Starting second instance of the Hive driver will thrown an error back.
Another way to configure is to use an external database which is JDBC compliant like MySQL as shown below. This configuration is called local metastore.
Here are the steps required to configure Hive in an local metastore way.

- Install MySql using apt-get.
sudo apt-get install mysql-server
- Next install the JDBC drivers for MySQL
sudo apt-get install libmysql-java
- Goto to the $HIVE_HOME/lib folder and create a link to the MySQL JDBC library.
ln -s /usr/share/java/mysql-connector-java.jar
- In the conf/hive-site.xml include the following properties to use the above installed MySQL as the metastore database instead of the default Derby database. Modify the properties appropriately to reflect the environment.
- Start the Hive shell and it will automatically connect to the MySQL database and create the required tables in it.

The next way is to configure Hive in the remote metastore way. In this configuration the Hive driver and the metastore interface would be running in a different JVM (which can run on different machines also) as shown below. This way the database can be fire-walled from the Hive user and also database credentials are completely isolated from users of Hive.
Here are the steps to run Hive in the remote metastore configuration.

- Include the below configuration in conf/hive-site.xml
- Start the Hive metastore interface
bin/hive --service metastore &
- The Hive metastore interface by default listens at port 9083. Make sure it is.
netstat -an | grep 9083
- Start the Hive shell and make sure the Hive DDL/DDL operations work.

With the assumption that Hive has been configured in the remote metastore way, we will look into how to install and configure HCatalog.


  1. Gave Good Clarity, Also i got this question :

    if at all, i am planing to configure "Remote Metastore", In that, Metastore & Sql DB on two different servers, Where should i install JDBC CONNECTION LIBRARY, Means, should that connector be installed part of Metastore server or It Must be in Metastore DB (mySQL) server ?

    Best Regards,

    1. Whoever is connecting to JDBC, the driver should be installed there. In the `remote metastore` configuration, the driver should be installed on the `metastore server`.

  2. From you explanation - can i create tables,index... on mysql db by using hive cli on local mode? i think only metadata of the hive tables,index... should store in mysql.

  3. Can I use remote Oracle DB as hive metastore database?