Tuesday, December 17, 2013

Imapala vs Hive performance on Amazon AWS EMR

Apache Hive and Impala provide a SQL layer on top of the data in HDFS. Hive converts the SQL query into a MR plan and submits it to the Hadoop cluster for execution, while the Impala daemons directly process the data in HDFS. In an earlier blog we looked into when to use Hive and when Impala. Also, as mentioned in an earlier blog, both of them can use a common metastore, so any tables created in Hive would be visible in Impala and vice versa.
Hive on AWS EMR had been there for some time, but recently Impala has been added to the list. In this blog, we will compare the performance of Impala vs Hive on Amazon EMR using the same hardware (m1.large for master and m1.large for the slave) on the same data set. Note that a lot of improvements are in progress for both Hive and Impala and also EMR is using the latest versions of Hive and Impala as of now.

So, here are the steps at a high level:

- Launch a EMR cluster with Impala and Hive on AWS.
- ssh to the master and create some data.
- Create some tables using Impala shell and map it to the above data.
- Run the same queries on Hive and Impala to measure the performance.

And here are the steps in detail :

- Create the KeyPairs from the EC2 Management Console and the Access Keys (Access Key ID and Secret Access Key) in the IAM Management Console as mentioned here. While creating the KeyPairs the user will be prompted to store a pem file, which will be used to login to the master.

- Go to the EMR Management console and create a cluster.
- Select the default properties, except for the below.
- From the EC2 Management console, get the hostname of the master and ssh to it as below using the pem file.
ssh -i thecloudavenue-impala.pem hadoop@ec2-50-19-63-105.compute-1.amazonaws.com
- Run the jps command and make sure the below processes are running. The reason why Impala daemons are not shown is because Impala is not developed in Java.
[hadoop@domU-12-31-39-16-6E-66 ~]$ jps
1494 Main
2202 RunJar
3748 Jps
2278 WebAppProxyServer
2346 NameNode
2381 JobClient
2568 ResourceManager
2499 JobHistoryServer
- Create an ssh tunnel to the master.
ssh –i thecloudavenue-emr.pem -ND 8157 hadoop@ec2-54-211-100-209.compute-1.amazonaws.com
Download the FoxyProxy addon and configure it as mentioned here. This will make the Impala and Hadoop console request go through the ssh tunnel to the master and the rest of the requests go as usual.
Statestore console - http://ec2-50-19-63-105.compute-1.amazonaws.com:25000
impalad console - http://ec2-50-19-63-105.compute-1.amazonaws.com:25010
Catalog console - http://ec2-50-19-63-105.compute-1.amazonaws.com:25020
ResourceManager console - http://ec2-50-19-63-105.compute-1.amazonaws.com:9026
HDFS console - http://ec2-50-19-63-105.compute-1.amazonaws.com:9101 
- Download the dbgen jar file and then generate the data for the test.
mkdir test
cd test
wget http://elasticmapreduce.s3.amazonaws.com/samples/impala/dbgen-1.0-jar-with-dependencies.jar
java -cp dbgen-1.0-jar-with-dependencies.jar DBGen -p /tmp/dbgen -b 1 -c 1 -t 1
- The data should appear in the /tmp/dbgen folder
ls -ltr
total 12
drwxr-xr-x 2 hadoop hadoop 4096 Dec 16 11:47 books
drwxr-xr-x 2 hadoop hadoop 4096 Dec 16 11:49 customers
drwxr-xr-x 2 hadoop hadoop 4096 Dec 16 11:50 transactions

- Put the data in HDFS
hadoop fs -mkdir /data/
hadoop fs -put /tmp/dbgen/* /data/
hadoop fs -ls -h -R /data/

- Start the Impala shell using the impala-shell command and create tables and map the above created data to it.
create EXTERNAL TABLE books( id BIGINT, isbn STRING, category STRING, publish_date TIMESTAMP, publisher STRING, price FLOAT ) ROW FORMAT DELIMITED FIELDS TERMINATED BY '|' LOCATION '/data/books/';
create EXTERNAL TABLE customers( id BIGINT, name STRING, date_of_birth TIMESTAMP, gender STRING, state STRING, email STRING, phone STRING ) ROW FORMAT DELIMITED FIELDS TERMINATED BY '|' LOCATION '/data/customers/';
create EXTERNAL TABLE transactions( id BIGINT, customer_id BIGINT, book_id BIGINT, quantity INT, transaction_date TIMESTAMP ) ROW FORMAT DELIMITED FIELDS TERMINATED BY '|' LOCATION '/data/transactions/';

- The table definition and the data mapping from the previous step will be created in the MySQL metastore and should be visible to Hive also. More details about the HCatalog here.

From Impala (impala-shell command)
show tables;
Query: show tables
| name         |
| books        |
| customers    |
| transactions |
Returned 3 row(s) in 0.02s
From Hive (hive command)
show tables;
Time taken: 2.518 seconds, Fetched: 3 row(s)
- Run the below queries from the Impala shell (impala-shell command) and the Hive shell (hive command). Note that the same queries can be run on both Hive and Imapala, except that the [SHUFFLE] keyword has to be removed for Hive.
Query1 :  select COUNT(*) FROM customers WHERE name = 'Harrison SMITH';
Query2 :  select category, count(*) cnt FROM books GROUP BY category ORDER BY cnt DESC LIMIT 10;
Query3 :  select tmp.book_category, ROUND(tmp.revenue, 2) AS revenue FROM ( SELECT books.category AS book_category, SUM(books.price * transactions.quantity) AS revenue FROM books JOIN [SHUFFLE] transactions ON ( transactions.book_id = books.id AND YEAR(transactions.transaction_date) BETWEEN 2008 AND 2010 ) GROUP BY books.category ) tmp ORDER BY revenue DESC LIMIT 10;
Query4 :  select tmp.book_category, ROUND(tmp.revenue, 2) AS revenue FROM ( SELECT books.category AS book_category, SUM(books.price * transactions.quantity) AS revenue FROM books JOIN [SHUFFLE] transactions ON ( transactions.book_id = books.id ) JOIN [SHUFFLE] customers ON ( transactions.customer_id = customers.id AND customers.state IN ('WA', 'CA', 'NY') ) GROUP BY books.category ) tmp ORDER BY revenue DESC LIMIT 10;
If the [SHUFFLE] keyword is not removed for Hive, then the below exception is thrown back.
    at org.apache.hadoop.hive.ql.parse.HiveParser_FromClauseParser.fromSource(HiveParser_FromClauseParser.java:2793)
    at org.apache.hadoop.hive.ql.parse.HiveParser_FromClauseParser.joinSource(HiveParser_FromClauseParser.java:1347)
    at org.apache.hadoop.hive.ql.parse.HiveParser_FromClauseParser.fromClause(HiveParser_FromClauseParser.java:1191)
    at org.apache.hadoop.hive.ql.parse.HiveParser.fromClause(HiveParser.java:30787)
    at org.apache.hadoop.hive.ql.parse.HiveParser.selectStatement(HiveParser.java:28976)
    at org.apache.hadoop.hive.ql.parse.HiveParser.regular_body(HiveParser.java:28884)
    at org.apache.hadoop.hive.ql.parse.HiveParser.queryStatement(HiveParser.java:28424)
    at org.apache.hadoop.hive.ql.parse.HiveParser.queryStatementExpression(HiveParser.java:28218)
    at org.apache.hadoop.hive.ql.parse.HiveParser_FromClauseParser.subQuerySource(HiveParser_FromClauseParser.java:4052)
Here the the response time (in seconds) for the same queries/data/hardware using Impala/Hive. Note that Impala is much faster when compared to Hive. But, there are different scenarios where Impala/Hive meet the requirements (more details here).
Note that for the whole exercise it took 16 normalized hours or 2 hours of 2 large ec2 instances. Check the prices here before launching the EMR cluster and make sure to terminate the cluster once done.
As I had been saying again and again, Amazon makes it easy for the general to get started very easy in terms of not only CAPEX/OPEX costs, but also in terms of the technical skills required. But, it's sad to see Amazon's contribution to the open source community. Here are the patches contributed back to Hive from Amazon.

Finally, here (1, 2, 3) are some interesting articles on the man behind Impala.

1) Here is an article on integrating R with Impala.
2) Here it the article from Amazon on the same.


No comments:

Post a Comment