Wednesday, October 18, 2017

Getting notified for any state change in the AWS resources (like EC2 going down)

EC2 can be used for any purpose like running a website, doing some batch processing. A website has a requirement to run 99.9 or 99.99 or some other percentage of the time, so back up of the EC2 instances are required for the sake of high availability. But, lets take the case of batch as in the case of transforming 1000's of records from one format to another, then high availability is not really important. If an instance fails then another can be started or the work can be automatically shifted to some other instance automatically as in the case of Big Data.

Just to quickly summarize, in the case as in the case of web servers we need some level of high availability and so multiple EC2 instances (backup), but in the case of batch processing there is no need of backup. Lets take the case of a batch job running on a single EC2 instance, it would be good to get some sort of notification when the instance goes down. We would be looking into the same in this blog. We would be using EC2, SNS and CloudWatch. So, it would be a good way to get familiar with the different topics.


So, here are the steps.

Step 1: Create a new topic in the SNS management console.




Step 2: Subscribe to the topic created in the previous step.




Step 3: You should be getting an email with a link, click on it to confirm the subscription and the status of the subscription changes on refreshing the screen.


Step 4: Start a Linux EC2 instance as mentioned here and note down the instance id.


Step 5: Create a rule in the Cloud Watch Management Console for the appropriate instance and the events.





Step 6: Perform different actions on the EC2 and you will notice that you would be getting an email.



Note that the above procedure can be used for getting a notification of any state changes in the AWS resources.

Tuesday, October 17, 2017

Microsoft Azure for AWS Professionals

A few months back I blogged about 'GCP for AWS Professionals' comparing the two platforms here. Now, Microsoft has published something similar comparing Microsoft Azure with Amazon AWS here.

It's good to know for Amazon AWS when their competitors are comparing their services with Amazon's. AWS has been adding new services and features (small and big) within them at a very rapid pace. Here you can get the new features introduced in Amazon AWS on a daily basis.

Similar to GCP and AWS, Azure also gives free credit to get started. So, now is the time to create an account and get started with Cloud. So, here are the links for the same (AWS, Azure and GCP).

Getting the execution times of the EMR Steps from the CLI

In the previous blog, we executed a Hive script to convert the Airline dataset from the original csv format to Parquet Snappy format. And then same query were run to csv and the Parquet Snappy format data to see the performance improvements. This involved three steps.

Step 1 : Create the ontime and the ontime_parquet_snappy table. Move the data from ontime table to the ontime_parquet_snappy table for the conversion of one format to another.

Step 2 : Execute the query on the ontime table, which represents the csv data.

Step 3 : Execute the query on the ontime_parquet_snappy time, which representa the Parquet Snappy data.

The execution time for the above three steps was got from the AWS EMR management console which is a Web UI. All the tasks which can be done from the AWS management console can also be done from the CLI (Command Line Interface) also. Lets see the steps involved to get the execution time for the steps in EMR.

Step 1 : Install the AWS CLI for the appropriate OS. Here are the instructions for the same.

Step 2 : Generate the Security Credentials. These are used to make calls from the SDK and CLI. More about Security Credentials here and how to generate them here.

Step 3 : Configure the AWS CLI by specifying the Security Credentials and the Region by running the 'aws config' command. More details here.

Step 4 : From the prompt execute the below command to get the cluster-id of the EMR cluster.
aws emr list-clusters --query 'Clusters[*].{Id:Id}'

Step 5 : For the above cluster-id get the step-id by executing the below command.
aws emr list-steps --cluster-id j-1WNWN0K81WR11 --query 'Steps[*].{Id:Id}'

Step 6 : For one of the above step-id get the start and the end time and so the execution time for the step.
aws emr describe-step --cluster-id j-1WNWN0K81WR11 --step-id s-3CTY1MTJ4IPRP --query 'Step.{StartTime:Status.Timeline.StartDateTime,EndTime:Status.Timeline.EndDateTime}'

The above commands might look a bit cryptic, but it's easy once you get started. The documentation for the same is here. As noticed, I have created a Ubuntu Virtual machine on top of Windows and executing the commands in Ubuntu.

Different ways of executing the Big Data processing jobs in EMR

There are different ways of kick starting a Hive/Pig/MR/Spark on Amazon EMR. We already looked at how to submit a Hive job or a step from the AWS EMR management console here. This approach is cool, but doesn't have much scope for automation.

Here are the other ways to start the Big Data Processing with some level of automation.

1) Use Apache Oozie to create a workflow and a coordinator.
2) Use the AWS CLI
3) Login to the master instance and use the Hive shell

In the above, Option 1 is a bit complicated and will be explored in another blog. Here we will be looking at the other two options.

Option 2 : Using the AWS CLI

Step 1 : Create the airline.sql with the below content. The below will create a table in Hive and map it to the data in S3. To get the data into S3 follow this article. Then a query will be run on the table.
create external table ontime_parquet_snappy (
  Year INT,
  Month INT,
  DayofMonth INT,
  DayOfWeek INT,
  DepTime  INT,
  CRSDepTime INT,
  ArrTime INT,
  CRSArrTime INT,
  UniqueCarrier STRING,
  FlightNum INT,
  TailNum STRING,
  ActualElapsedTime INT,
  CRSElapsedTime INT,
  AirTime INT,
  ArrDelay INT,
  DepDelay INT,
  Origin STRING,
  Dest STRING,
  Distance INT,
  TaxiIn INT,
  TaxiOut INT,
  Cancelled INT,
  CancellationCode STRING,
  Diverted STRING,
  CarrierDelay INT,
  WeatherDelay INT,
  NASDelay INT,
  SecurityDelay INT,
  LateAircraftDelay INT
) STORED AS PARQUET LOCATION 's3://airline-dataset/airline-parquet-snappy/' TBLPROPERTIES ("orc.compress"="SNAPPY");

INSERT OVERWRITE DIRECTORY 's3://airline-dataset/parquet-snappy-query-output' select Origin, count(*) from ontime_parquet_snappy where DepTime > CRSDepTime group by Origin; 
Step 2 : Put the above file into the master node using the below command.
aws emr put --cluster-id j-PQSG2Q9DS9HV --key-pair-file "/home/praveen/Documents/AWS-Keys/MyKeyPair.pem" --src "/home/praveen/Desktop/airline.sql"
Don't forget to replace the cluster-id, the path of the key-pair and the sql file in the above command.

Step 3 : Kick start the Hive program using the below command.
aws emr ssh --cluster-id j-PQSG2Q9DS9HV --key-pair-file "/home/praveen/Documents/AWS-Keys/MyKeyPair.pem" --command "hive -f airline.sql"
Replace the cluster-id and the key-pair path in the above command.

Step 4 : The last and the final step is to monitor the progress of the Hive job and verify the output in the S3 management console.



Option 3 : Login to the master instance and use the Hive shell

Step 1 : Delete the output of the Hive query which has been created in the above Option.

Step 2 : Follow the steps mentioned here to ssh into the master.

Step 3 : Start the Hive shell using the 'hive' command are create a table in Hive as shown below.


Step 4 : Check if the table has been created or not as shown below using the show and the describe SQL commands.


Step 5 : Execute the Hive query in the shell and wait for it to complete.



Step 6 : Verify the output of the Hive job in S3 management console.

Step 7 : Forward the local port to the remote port as mentioned here and access the YARN console, to see the status of the Hive job.


This completes the steps for submitting a Hive job in different ways. The same steps can be repeated with minimum changes for Pig, Sqoop and other Big Data softwares also.

EMR logging into the master instance

Once we spawn a Cluster as mentioned here, we should see the instances in the EC2 management console. It would be nice to login to the master instance. All the log files are generated on the master and then moved to S3. Also, the different Big Data processing jobs can be run from the master command line interface.

In this blog we will look into connecting to the master. The AWS documentation for the same is here.

Step 1 : Click on the gear button on the top right. The columns in the page can be added or deleted here.


Include the EMR related keys as shown in the right of the above screen shot and the EC2 instance roles (MASTER and CORE) will be displayed as shown below.


Get the DNS hostname of the master instance after select it.

Step 2 : From the same EC2 management console, modify the Security Group associated with the master instance to allow inbound port 22 as shown below.




Step 3 : Now ssh into the master as shown below. Note that the DNS name of the master has to be changed.
ssh -i /home/praveen/Documents/AWS-Keys/MyKeyPair.pem hadoop@ec2-54-147-238-2.compute-1.amazonaws.com

Step 4 : Go to the '/mnt/var/log/' and check the different log files.

In the upcoming blog, we will explore running a Hive script from the master itself once we have logged into it.

Monday, October 16, 2017

How to get familiar with the different AWS Services quickly?

AWS has got a lot of services and they are introducing new services and a lot of features within them at a very quick pace. It's a difficult task to get in pace with them. New features (small and big) are introduced almost. daily. Here is a blog to get updated on the latest services and features in AWS across different services. In this blog, you will notice that almost every day there is something new.

AWS documentation comes with 'Getting Started' guides/tutorials as the name says to get started with the different AWS Service quickly and don't go into too much of detail. For those who want to become an AWS Architect, an understanding of the different AWS Services is quite essential and these 'Getting Started' guides/tutorials are helpful for the same.

The 'Getting Started' guides/tutorials for different AWS Services have a different URL pattern and so is difficult to figure out. So, a quick Google search with the below URL will find out all the AWS 'Getting Started' guides/tutorials in the AWS documentation for the different services. Click on the Next in the search page to get a few more of them.


https://www.google.co.in/search?q=getting+started+site%3Aaws.amazon.com

Again, I would strongly recommend going through the above  'Getting Started' guides/tutorials for the wannabe AWS Architects. Hope it helps.

Saturday, October 14, 2017

aws-summary : Different ways of replicating data in S3

S3 has different storage classes with different durability and availability as mentioned here. S3 provides very high durability and availability, but if more is required then CRR (Cross Region Replication) can be used. CRR as the name, replicates the data automatically across buckets in two different regions to provide even more durability and availability.


Here are a few resources around CRR.

About and using CRR

Monitoring the CRR

The below two approaches are for replicating S3 data within the same region which CRR doesn't support.

Content Replication Using AWS Lambda and Amazon S3

Synchronizing Amazon S3 Buckets Using AWS Step Functions

Nothing new here, but a way of consolidating resources around replication of S3 data. AWS has a lot of good resources to get started and also with advanced topics, but they are dispersed and difficult to find out. I will updating this blog as I find new resources around this topic and also based on the comments in this blog.

Also, the above articles/blogs use multiple services from AWS, so it would be a nice way to get familiar with them.

Thursday, October 12, 2017

Is it on AWS?

I am really fascinated by Serverless model. There is no need to think in terms of the servers, no need to scale, cheap when compared to server hosting etc etc. I agree there are some issues like cold start of the containers when the Lambda or in fact any any serverless function is invoked after a long time.

I came across this 'Is it on AWS?' a few months back and bookmarked it, but didn't get a chance to try it out. It uses a Lambda function to tell if a particular domain name or ip address is in the is in the published list of AWS IP address ranges. The site has links to a blog and code for the same, so would not like to expand it here.

Have fun reading the blog !!!!

Creating an Application Load Balancer and querying it's logging data using Athena

When building a highly scalable website like amazon.com, there would be thousands of web servers and all of them would be fronted by multiple load balancers as shown below. The end user would be pointing to the load balancer which would forward the requests to the web servers. In the case of the AWS ELB (Elastic Load Balancer), the distribution of the traffic from load balancer to the servers is in a round-robin fashion and doesn't consider the size of the server or how busy/idle the servers are. May be AWS will add this feature in the upcoming releases.


In this blog, we would be analyzing the number of users coming to a website from different ip addresses. Here are the steps at a high level which we would be exploring in a bit more detail. This is again a lengthy post where would be using a couple of AWS services (ELB, EC2, S3 and  Athena) and see how they work together.
   
    - Create two Linux EC2 instance with web servers with different content
    - Create an Application Load Balancer and forward the requests to the above web servers
    - Enable the logging on the Application Load Balancer to S3
    - Analyze the logging data using Athena

To continue further, the following can be done (not covered in this article)

    - Create a Lambda function to call the Athena query at regular intervals
    - Auto Scale the EC2 instances depending on the resource utilization
    - Remove the Load Balancer data from s3 after a certain duration

Step 1: Create two Linux instances and install web servers as mentioned in this blog. In the /var/www/html folder have the files as mentioned below. Ports 22 and 80 have to be opened for accessing the instance through ssh and for accessing the web pages in the browser.

     server1 - index.html
     server2 - index.hml and img/someimage.png

Make sure that ip-server1, ip-server2 and ip-server2/img/someimage.png are accessible from the web browser. Note that the image should be present in the img folder. The index.html is for serving the web pages and also for the health check, while the image is for serving the web pages.



Step 2: Create the Target Group.



Step 3: Attach the EC2 instances to the Target Group.




Step 4: Change the Target Group's health checks. This will make the instances healthy faster.


Step 5: Create the second Target Group. Associate server2 with the target-group2 as mentioned in the flow diagram.


Step 6: Now is the time to create the Application Load Balancer. This balancer is relatively new when compared to the Classic Load Balancer. Here is there difference between the different Load Balancers. The Application Load Balancer operates at the layer 7 of the OSI and supports host-based and path-based routing. Any web requests with '/img/*' pattern would be sent to the target-group2, rest by default would be sent to target-group1 after completing the below settings.







Step 7: Associate the target-group1 with the Load Balancer, the target-group2 will be associated later.





Step 8: Enable access logs on the Load Balancer by editing the attributes. The specified S3 bucket for storing the logs will be automatically created.



Few minutes after the Load Balancer has been created, the instances should turn into a healthy state as shown below. If not, then maybe one of the above steps has been missed.


Step 9: Get the DNS name of the Load Balancer and open it in the browser to make sure that the Load Balancer is working.


Step 10: Now is the time to associate the second Target Group (target-group2). Click on View/edit rules) and add a rule.





Any requests with the path /img/* would be sent to the target-group2, rest of them would be redirected to the target-group2.



Step 11: Once the Load Balancer has been accessed from different browsers a couple of times, the log files should be generated in S3 as shown below.


Step 12: Now it's time to create tables in Athena and then map it to the data in S3 and query the tables. The DDL and the DML commands for Athena can be found here.




We have seen how to create a Load Balancer, associate Linux web servers with them and finally check how to query the log data with Athena. Make sure that all the AWS resources which have been created are deleted to stop the billing for them.

That's it for now.