Tuesday, October 17, 2017

Microsoft Azure for AWS Professionals

A few months back I blogged about 'GCP for AWS Professionals' comparing the two platforms here. Now, Microsoft has published something similar comparing Microsoft Azure with Amazon AWS here.

It's good to know for Amazon AWS when their competitors are comparing their services with Amazon's. AWS has been adding new services and features (small and big) within them at a very rapid pace. Here you can get the new features introduced in Amazon AWS on a daily basis.

Similar to GCP and AWS, Azure also gives free credit to get started. So, now is the time to create an account and get started with Cloud. So, here are the links for the same (AWS, Azure and GCP).

Getting the execution times of the EMR Steps from the CLI

In the previous blog, we executed a Hive script to convert the Airline dataset from the original csv format to Parquet Snappy format. And then same query were run to csv and the Parquet Snappy format data to see the performance improvements. This involved three steps.

Step 1 : Create the ontime and the ontime_parquet_snappy table. Move the data from ontime table to the ontime_parquet_snappy table for the conversion of one format to another.

Step 2 : Execute the query on the ontime table, which represents the csv data.

Step 3 : Execute the query on the ontime_parquet_snappy time, which representa the Parquet Snappy data.

The execution time for the above three steps was got from the AWS EMR management console which is a Web UI. All the tasks which can be done from the AWS management console can also be done from the CLI (Command Line Interface) also. Lets see the steps involved to get the execution time for the steps in EMR.

Step 1 : Install the AWS CLI for the appropriate OS. Here are the instructions for the same.

Step 2 : Generate the Security Credentials. These are used to make calls from the SDK and CLI. More about Security Credentials here and how to generate them here.

Step 3 : Configure the AWS CLI by specifying the Security Credentials and the Region by running the 'aws config' command. More details here.

Step 4 : From the prompt execute the below command to get the cluster-id of the EMR cluster.
aws emr list-clusters --query 'Clusters[*].{Id:Id}'

Step 5 : For the above cluster-id get the step-id by executing the below command.
aws emr list-steps --cluster-id j-1WNWN0K81WR11 --query 'Steps[*].{Id:Id}'

Step 6 : For one of the above step-id get the start and the end time and so the execution time for the step.
aws emr describe-step --cluster-id j-1WNWN0K81WR11 --step-id s-3CTY1MTJ4IPRP --query 'Step.{StartTime:Status.Timeline.StartDateTime,EndTime:Status.Timeline.EndDateTime}'

The above commands might look a bit cryptic, but it's easy once you get started. The documentation for the same is here. As noticed, I have created a Ubuntu Virtual machine on top of Windows and executing the commands in Ubuntu.

Different ways of executing the Big Data processing jobs in EMR

There are different ways of kick starting a Hive/Pig/MR/Spark on Amazon EMR. We already looked at how to submit a Hive job or a step from the AWS EMR management console here. This approach is cool, but doesn't have much scope for automation.

Here are the other ways to start the Big Data Processing with some level of automation.

1) Use Apache Oozie to create a workflow and a coordinator.
2) Use the AWS CLI
3) Login to the master instance and use the Hive shell

In the above, Option 1 is a bit complicated and will be explored in another blog. Here we will be looking at the other two options.

Option 2 : Using the AWS CLI

Step 1 : Create the airline.sql with the below content. The below will create a table in Hive and map it to the data in S3. To get the data into S3 follow this article. Then a query will be run on the table.
create external table ontime_parquet_snappy (
  Year INT,
  Month INT,
  DayofMonth INT,
  DayOfWeek INT,
  DepTime  INT,
  CRSDepTime INT,
  ArrTime INT,
  CRSArrTime INT,
  UniqueCarrier STRING,
  FlightNum INT,
  TailNum STRING,
  ActualElapsedTime INT,
  CRSElapsedTime INT,
  AirTime INT,
  ArrDelay INT,
  DepDelay INT,
  Origin STRING,
  Dest STRING,
  Distance INT,
  TaxiIn INT,
  TaxiOut INT,
  Cancelled INT,
  CancellationCode STRING,
  Diverted STRING,
  CarrierDelay INT,
  WeatherDelay INT,
  NASDelay INT,
  SecurityDelay INT,
  LateAircraftDelay INT
) STORED AS PARQUET LOCATION 's3://airline-dataset/airline-parquet-snappy/' TBLPROPERTIES ("orc.compress"="SNAPPY");

INSERT OVERWRITE DIRECTORY 's3://airline-dataset/parquet-snappy-query-output' select Origin, count(*) from ontime_parquet_snappy where DepTime > CRSDepTime group by Origin; 
Step 2 : Put the above file into the master node using the below command.
aws emr put --cluster-id j-PQSG2Q9DS9HV --key-pair-file "/home/praveen/Documents/AWS-Keys/MyKeyPair.pem" --src "/home/praveen/Desktop/airline.sql"
Don't forget to replace the cluster-id, the path of the key-pair and the sql file in the above command.

Step 3 : Kick start the Hive program using the below command.
aws emr ssh --cluster-id j-PQSG2Q9DS9HV --key-pair-file "/home/praveen/Documents/AWS-Keys/MyKeyPair.pem" --command "hive -f airline.sql"
Replace the cluster-id and the key-pair path in the above command.

Step 4 : The last and the final step is to monitor the progress of the Hive job and verify the output in the S3 management console.



Option 3 : Login to the master instance and use the Hive shell

Step 1 : Delete the output of the Hive query which has been created in the above Option.

Step 2 : Follow the steps mentioned here to ssh into the master.

Step 3 : Start the Hive shell using the 'hive' command are create a table in Hive as shown below.


Step 4 : Check if the table has been created or not as shown below using the show and the describe SQL commands.


Step 5 : Execute the Hive query in the shell and wait for it to complete.



Step 6 : Verify the output of the Hive job in S3 management console.

Step 7 : Forward the local port to the remote port as mentioned here and access the YARN console, to see the status of the Hive job.


This completes the steps for submitting a Hive job in different ways. The same steps can be repeated with minimum changes for Pig, Sqoop and other Big Data softwares also.

EMR logging into the master instance

Once we spawn a Cluster as mentioned here, we should see the instances in the EC2 management console. It would be nice to login to the master instance. All the log files are generated on the master and then moved to S3. Also, the different Big Data processing jobs can be run from the master command line interface.

In this blog we will look into connecting to the master. The AWS documentation for the same is here.

Step 1 : Click on the gear button on the top right. The columns in the page can be added or deleted here.


Include the EMR related keys as shown in the right of the above screen shot and the EC2 instance roles (MASTER and CORE) will be displayed as shown below.


Get the DNS hostname of the master instance after select it.

Step 2 : From the same EC2 management console, modify the Security Group associated with the master instance to allow inbound port 22 as shown below.




Step 3 : Now ssh into the master as shown below. Note that the DNS name of the master has to be changed.
ssh -i /home/praveen/Documents/AWS-Keys/MyKeyPair.pem hadoop@ec2-54-147-238-2.compute-1.amazonaws.com

Step 4 : Go to the '/mnt/var/log/' and check the different log files.

In the upcoming blog, we will explore running a Hive script from the master itself once we have logged into it.

Monday, October 16, 2017

How to get familiar with the different AWS Services quickly?

AWS has got a lot of services and they are introducing new services and a lot of features within them at a very quick pace. It's a difficult task to get in pace with them. New features (small and big) are introduced almost. daily. Here is a blog to get updated on the latest services and features in AWS across different services. In this blog, you will notice that almost every day there is something new.

AWS documentation comes with 'Getting Started' guides/tutorials as the name says to get started with the different AWS Service quickly and don't go into too much of detail. For those who want to become an AWS Architect, an understanding of the different AWS Services is quite essential and these 'Getting Started' guides/tutorials are helpful for the same.

The 'Getting Started' guides/tutorials for different AWS Services have a different URL pattern and so is difficult to figure out. So, a quick Google search with the below URL will find out all the AWS 'Getting Started' guides/tutorials in the AWS documentation for the different services. Click on the Next in the search page to get a few more of them.


https://www.google.co.in/search?q=getting+started+site%3Aaws.amazon.com

Again, I would strongly recommend going through the above  'Getting Started' guides/tutorials for the wannabe AWS Architects. Hope it helps.

Saturday, October 14, 2017

aws-summary : Different ways of replicating data in S3

S3 has different storage classes with different durability and availability as mentioned here. S3 provides very high durability and availability, but if more is required then CRR (Cross Region Replication) can be used. CRR as the name, replicates the data automatically across buckets in two different regions to provide even more durability and availability.


Here are a few resources around CRR.

About and using CRR

Monitoring the CRR

The below two approaches are for replicating S3 data within the same region which CRR doesn't support.

Content Replication Using AWS Lambda and Amazon S3

Synchronizing Amazon S3 Buckets Using AWS Step Functions

Nothing new here, but a way of consolidating resources around replication of S3 data. AWS has a lot of good resources to get started and also with advanced topics, but they are dispersed and difficult to find out. I will updating this blog as I find new resources around this topic and also based on the comments in this blog.

Also, the above articles/blogs use multiple services from AWS, so it would be a nice way to get familiar with them.

Thursday, October 12, 2017

Is it on AWS?

I am really fascinated by Serverless model. There is no need to think in terms of the servers, no need to scale, cheap when compared to server hosting etc etc. I agree there are some issues like cold start of the containers when the Lambda or in fact any any serverless function is invoked after a long time.

I came across this 'Is it on AWS?' a few months back and bookmarked it, but didn't get a chance to try it out. It uses a Lambda function to tell if a particular domain name or ip address is in the is in the published list of AWS IP address ranges. The site has links to a blog and code for the same, so would not like to expand it here.

Have fun reading the blog !!!!