Apache Hive and Impala provide a SQL layer on top of the data in HDFS. Hive converts the SQL query into a MR plan and submits it to the Hadoop cluster for execution, while the Impala daemons directly process the data in HDFS. In an earlier blog we looked into when to use Hive and when Impala. Also, as mentioned in an earlier blog, both of them can use a common metastore, so any tables created in Hive would be visible in Impala and vice versa.
Hive on AWS EMR had been there for some time, but recently Impala has been added to the list. In this blog, we will compare the performance of Impala vs Hive on Amazon EMR using the same hardware (m1.large for master and m1.large for the slave) on the same data set. Note that a lot of improvements are in progress for both Hive and Impala and also EMR is using the latest versions of Hive and Impala as of now.
So, here are the steps at a high level:
- Launch a EMR cluster with Impala and Hive on AWS.
- ssh to the master and create some data.
- Create some tables using Impala shell and map it to the above data.
- Run the same queries on Hive and Impala to measure the performance.
And here are the steps in detail :
- Create the KeyPairs from the EC2 Management Console and the Access Keys (Access Key ID and Secret Access Key) in the IAM Management Console as mentioned here. While creating the KeyPairs the user will be prompted to store a pem file, which will be used to login to the master.
- Go to the EMR Management console and create a cluster.
- Select the default properties, except for the below.
- From the EC2 Management console, get the hostname of the master and ssh to it as below using the pem file.
From Impala (impala-shell command)
Hive on AWS EMR had been there for some time, but recently Impala has been added to the list. In this blog, we will compare the performance of Impala vs Hive on Amazon EMR using the same hardware (m1.large for master and m1.large for the slave) on the same data set. Note that a lot of improvements are in progress for both Hive and Impala and also EMR is using the latest versions of Hive and Impala as of now.
So, here are the steps at a high level:
- Launch a EMR cluster with Impala and Hive on AWS.
- ssh to the master and create some data.
- Create some tables using Impala shell and map it to the above data.
- Run the same queries on Hive and Impala to measure the performance.
And here are the steps in detail :
- Create the KeyPairs from the EC2 Management Console and the Access Keys (Access Key ID and Secret Access Key) in the IAM Management Console as mentioned here. While creating the KeyPairs the user will be prompted to store a pem file, which will be used to login to the master.
- Go to the EMR Management console and create a cluster.
- Select the default properties, except for the below.
- From the EC2 Management console, get the hostname of the master and ssh to it as below using the pem file.
ssh -i thecloudavenue-impala.pem hadoop@ec2-50-19-63-105.compute-1.amazonaws.com- Run the jps command and make sure the below processes are running. The reason why Impala daemons are not shown is because Impala is not developed in Java.
[hadoop@domU-12-31-39-16-6E-66 ~]$ jps 1494 Main 2202 RunJar 3748 Jps 2278 WebAppProxyServer 2159 2346 NameNode 2381 JobClient 2568 ResourceManager 2499 JobHistoryServer 2139- Create an ssh tunnel to the master.
ssh –i thecloudavenue-emr.pem -ND 8157 hadoop@ec2-54-211-100-209.compute-1.amazonaws.com- Download the FoxyProxy addon and configure it as mentioned here. This will make the Impala and Hadoop console request go through the ssh tunnel to the master and the rest of the requests go as usual.
Statestore console - http://ec2-50-19-63-105.compute-1.amazonaws.com:25000 impalad console - http://ec2-50-19-63-105.compute-1.amazonaws.com:25010 Catalog console - http://ec2-50-19-63-105.compute-1.amazonaws.com:25020 ResourceManager console - http://ec2-50-19-63-105.compute-1.amazonaws.com:9026 HDFS console - http://ec2-50-19-63-105.compute-1.amazonaws.com:9101- Download the dbgen jar file and then generate the data for the test.
mkdir test cd test wget http://elasticmapreduce.s3.amazonaws.com/samples/impala/dbgen-1.0-jar-with-dependencies.jar java -cp dbgen-1.0-jar-with-dependencies.jar DBGen -p /tmp/dbgen -b 1 -c 1 -t 1- The data should appear in the /tmp/dbgen folder
ls -ltr total 12 drwxr-xr-x 2 hadoop hadoop 4096 Dec 16 11:47 books drwxr-xr-x 2 hadoop hadoop 4096 Dec 16 11:49 customers drwxr-xr-x 2 hadoop hadoop 4096 Dec 16 11:50 transactions- Put the data in HDFS
hadoop fs -mkdir /data/ hadoop fs -put /tmp/dbgen/* /data/ hadoop fs -ls -h -R /data/- Start the Impala shell using the impala-shell command and create tables and map the above created data to it.
create EXTERNAL TABLE books( id BIGINT, isbn STRING, category STRING, publish_date TIMESTAMP, publisher STRING, price FLOAT ) ROW FORMAT DELIMITED FIELDS TERMINATED BY '|' LOCATION '/data/books/'; create EXTERNAL TABLE customers( id BIGINT, name STRING, date_of_birth TIMESTAMP, gender STRING, state STRING, email STRING, phone STRING ) ROW FORMAT DELIMITED FIELDS TERMINATED BY '|' LOCATION '/data/customers/'; create EXTERNAL TABLE transactions( id BIGINT, customer_id BIGINT, book_id BIGINT, quantity INT, transaction_date TIMESTAMP ) ROW FORMAT DELIMITED FIELDS TERMINATED BY '|' LOCATION '/data/transactions/';- The table definition and the data mapping from the previous step will be created in the MySQL metastore and should be visible to Hive also. More details about the HCatalog here.
From Impala (impala-shell command)
show tables; Query: show tables +--------------+ | name | +--------------+ | books | | customers | | transactions | +--------------+ Returned 3 row(s) in 0.02sFrom Hive (hive command)
show tables; OK books customers transactions Time taken: 2.518 seconds, Fetched: 3 row(s)- Run the below queries from the Impala shell (impala-shell command) and the Hive shell (hive command). Note that the same queries can be run on both Hive and Imapala, except that the [SHUFFLE] keyword has to be removed for Hive.
Query1 : select COUNT(*) FROM customers WHERE name = 'Harrison SMITH'; Query2 : select category, count(*) cnt FROM books GROUP BY category ORDER BY cnt DESC LIMIT 10; Query3 : select tmp.book_category, ROUND(tmp.revenue, 2) AS revenue FROM ( SELECT books.category AS book_category, SUM(books.price * transactions.quantity) AS revenue FROM books JOIN [SHUFFLE] transactions ON ( transactions.book_id = books.id AND YEAR(transactions.transaction_date) BETWEEN 2008 AND 2010 ) GROUP BY books.category ) tmp ORDER BY revenue DESC LIMIT 10; Query4 : select tmp.book_category, ROUND(tmp.revenue, 2) AS revenue FROM ( SELECT books.category AS book_category, SUM(books.price * transactions.quantity) AS revenue FROM books JOIN [SHUFFLE] transactions ON ( transactions.book_id = books.id ) JOIN [SHUFFLE] customers ON ( transactions.customer_id = customers.id AND customers.state IN ('WA', 'CA', 'NY') ) GROUP BY books.category ) tmp ORDER BY revenue DESC LIMIT 10;If the [SHUFFLE] keyword is not removed for Hive, then the below exception is thrown back.
NoViableAltException(268@[]) at org.apache.hadoop.hive.ql.parse.HiveParser_FromClauseParser.fromSource(HiveParser_FromClauseParser.java:2793) at org.apache.hadoop.hive.ql.parse.HiveParser_FromClauseParser.joinSource(HiveParser_FromClauseParser.java:1347) at org.apache.hadoop.hive.ql.parse.HiveParser_FromClauseParser.fromClause(HiveParser_FromClauseParser.java:1191) at org.apache.hadoop.hive.ql.parse.HiveParser.fromClause(HiveParser.java:30787) at org.apache.hadoop.hive.ql.parse.HiveParser.selectStatement(HiveParser.java:28976) at org.apache.hadoop.hive.ql.parse.HiveParser.regular_body(HiveParser.java:28884) at org.apache.hadoop.hive.ql.parse.HiveParser.queryStatement(HiveParser.java:28424) at org.apache.hadoop.hive.ql.parse.HiveParser.queryStatementExpression(HiveParser.java:28218) at org.apache.hadoop.hive.ql.parse.HiveParser_FromClauseParser.subQuerySource(HiveParser_FromClauseParser.java:4052)Here the the response time (in seconds) for the same queries/data/hardware using Impala/Hive. Note that Impala is much faster when compared to Hive. But, there are different scenarios where Impala/Hive meet the requirements (more details here).
As I had been saying again and again, Amazon makes it easy for the general to get started very easy in terms of not only CAPEX/OPEX costs, but also in terms of the technical skills required. But, it's sad to see Amazon's contribution to the open source community. Here are the patches contributed back to Hive from Amazon.
1) Here is an article on integrating R with Impala.
2) Here it the article from Amazon on the same.
Reference
No comments:
Post a Comment