In the previous blog, we executed a Hive script to convert the Airline dataset from the original csv format to Parquet Snappy format. And then same query were run to csv and the Parquet Snappy format data to see the performance improvements. This involved three steps.
Step 1 : Create the ontime and the ontime_parquet_snappy table. Move the data from ontime table to the ontime_parquet_snappy table for the conversion of one format to another.
Step 2 : Execute the query on the ontime table, which represents the csv data.
Step 3 : Execute the query on the ontime_parquet_snappy time, which representa the Parquet Snappy data.
The execution time for the above three steps was got from the AWS EMR management console which is a Web UI. All the tasks which can be done from the AWS management console can also be done from the CLI (Command Line Interface) also. Lets see the steps involved to get the execution time for the steps in EMR.
Step 1 : Install the AWS CLI for the appropriate OS. Here are the instructions for the same.
Step 2 : Generate the Security Credentials. These are used to make calls from the SDK and CLI. More about Security Credentials here and how to generate them here.
Step 3 : Configure the AWS CLI by specifying the Security Credentials and the Region by running the 'aws config' command. More details here.
Step 4 : From the prompt execute the below command to get the cluster-id of the EMR cluster.
Step 5 : For the above cluster-id get the step-id by executing the below command.
Step 6 : For one of the above step-id get the start and the end time and so the execution time for the step.
The above commands might look a bit cryptic, but it's easy once you get started. The documentation for the same is here. As noticed, I have created a Ubuntu Virtual machine on top of Windows and executing the commands in Ubuntu.
Step 1 : Create the ontime and the ontime_parquet_snappy table. Move the data from ontime table to the ontime_parquet_snappy table for the conversion of one format to another.
Step 2 : Execute the query on the ontime table, which represents the csv data.
Step 3 : Execute the query on the ontime_parquet_snappy time, which representa the Parquet Snappy data.
The execution time for the above three steps was got from the AWS EMR management console which is a Web UI. All the tasks which can be done from the AWS management console can also be done from the CLI (Command Line Interface) also. Lets see the steps involved to get the execution time for the steps in EMR.
Step 1 : Install the AWS CLI for the appropriate OS. Here are the instructions for the same.
Step 2 : Generate the Security Credentials. These are used to make calls from the SDK and CLI. More about Security Credentials here and how to generate them here.
Step 3 : Configure the AWS CLI by specifying the Security Credentials and the Region by running the 'aws config' command. More details here.
Step 4 : From the prompt execute the below command to get the cluster-id of the EMR cluster.
aws emr list-clusters --query 'Clusters[*].{Id:Id}'
Step 5 : For the above cluster-id get the step-id by executing the below command.
aws emr list-steps --cluster-id j-1WNWN0K81WR11 --query 'Steps[*].{Id:Id}'
Step 6 : For one of the above step-id get the start and the end time and so the execution time for the step.
aws emr describe-step --cluster-id j-1WNWN0K81WR11 --step-id s-3CTY1MTJ4IPRP --query 'Step.{StartTime:Status.Timeline.StartDateTime,EndTime:Status.Timeline.EndDateTime}'
The above commands might look a bit cryptic, but it's easy once you get started. The documentation for the same is here. As noticed, I have created a Ubuntu Virtual machine on top of Windows and executing the commands in Ubuntu.
No comments:
Post a Comment