Tuesday, June 27, 2017

Processing the Airline dataset with AWS Athena

AWS Athena is a interactive query engine to process the data in S3. Athena is based on Presto which was developed by Facebook and then open sourced. With Athena there is no need to start a cluster, spawn EC2 instances. Simply create a table, point it to the data in S3 and run the queries.

In the previous blog, we looked at converting the Airline dataset from the original csv format to the columnar format and then run SQL queries on the two data sets using Hive/EMR combination. In this blog we will process the same data sets using Athena. So, here are the steps.

Step 1 : Go to the Athena Query Editor and create the ontime and the ontime_parquet_snappy table as shown below. The DDL queries for creating these two tables can be got from this blog.

Step 2 : Run the query on the ontime and the ontime_parquet_snappy table as shown below. Again the queries can be got from the blog mentioned in Step 1.

Note that, for processing the csv data it took 3.56 seconds and 2.14 GB of S3 data was scanned. For processing the Parquet Snappy data it took 3.07 seconds and 46.21 MB of S3 data was scanned.

There is not a significant time difference running the queries on the two datasets. But, Athena pricing is based on the amount of data scanned in the S3. So, the cost is significantly less to process the Parquet Snappy data than the csv data.

Step 3 : Go to the Catalog Manager and drop the tables. Dropping them will simply delete the table definition, but not associated data in S3.

Just out of curiosity I created the two tables again and run a different query this time. Below are the queries with the metrics.
select distinct(origin) from ontime_parquet_snappy;
Run time: 2.33 seconds, Data scanned: 4.76MB

select distinct(origin) from ontime;
Run time: 1.93 seconds, Data scanned: 2.14GB

As usual the there is not much difference in the time taken for the query execution, but the amount of data scanned in S3 for the Parquet Snappy data is significantly lower. So, the cost to run the query on the Parquet Snappy format data is significantly less.

Friday, June 23, 2017

Algorithmia - a store for algorithms, models and functions

I came across Algorithmia a few months back and didn't get a change to try it out. Again it came into focus with a Series A funding of $10.5M. More about the funding here.

Algorithmia  is a place where algorithms, models or functions can be discovered and be used for credits which we can buy. We get 5,000 credits every month for free. For example if a model costs 20 credits, then it can be called 250 times a month.

Create a free account here and get the API key from the profile. Now we should be able to call the different models using different languages like Python, Java, R and commands like curl. Below are the curl commands to do a sentimental analysis on a sentence. Make sure to replace the API_KEY with your own key.

curl -X POST -d '{"sentence": "I really like this website called algorithmia"}' -H 'Content-Type: application/json' -H 'Authorization: Simple API_KEY' https://api.algorithmia.com/v1/algo/nlp/SocialSentimentAnalysis

{"result":[{"compound":0.4201,"negative":0,"neutral":0.642,"positive":0.358,"sentence":"I really like this website called algorithmia"}],"metadata":{"content_type":"json","duration":0.010212005}}

curl -X POST -d '{"sentence": "I really dont like this website called algorithmia"}' -H 'Content-Type: application/json' -H 'Authorization: Simple API_KEY' https://api.algorithmia.com/v1/algo/nlp/SocialSentimentAnalysis

{"result":[{"compound":-0.3374,"negative":0.285,"neutral":0.715,"positive":0,"sentence":"I really dont like this website called algorithmia"}],"metadata":{"content_type":"json","duration":0.009965723}}
Algorithmia is more like a Google Play Store and Apple App Store, where individuals and companies can upload mobile applications and rest of us can download the same. It's an attempt to democratize Artificial Intelligence and Machine Learning.

Here is a service to convert the black and white to color images.

Monday, June 19, 2017

Converting Airline dataset from the row format to columnar format using AWS EMR

To process Big Data huge number of machines are required. Instead of buying them, it's better to process the data in the Cloud as it provides lower CAPEX and OPEX costs. In this blog we will at processing the airline data set in the AWS EMR (Elastic MapReduce). EMR provides Big Data as a service. We don't need to worry about installing, configuring, patching, security aspects of the Big Data software. EMR takes care of them, just we need specify the size and the number of the machines in the cluster, the location of the input/output data and finally the program to run. It's as easy as this.

The Airline dataset is in a csv format which is efficient for fetching the data in a row wise format based on some condition. But, not really efficient when we want to do some aggregations. So, we would be converting the CSV data into Parquet format and then run the same queries on the csv and Parquet format to observe the performance improvements.

Note that using the AWS EMR will incur cost and doesn't fall under the AWS free tier as we would be launching not the t2.micro EC2 instances, but a bit bigger EC2 instances. I will try to keep the cost to the minimum as this is a demo. Also, I prepared the required scripts ahead and tested them in the local machine on small data sets instead of the AWS EMR. This will save the AWS expenses.

So, here are the steps

Step 1 : Download the Airline data set from here and uncompress the same. All the data sets can be downloaded and uncompressed. But, to keep the cost to the minimum I downloaded the 1987, 1989, 1991, 1993 and 2007 related data and uploaded to S3 as shown below.

Step 2 : Create a folder called scripts and upload them to S3.

The '1-create-tables-move-data.sql' script will create the ontime and the ontime_parquet_snappy table, map the data to the table and finally move the data from the ontime table to the ontime_parquet_snappy table after transforming the data from the csv to the Parquet format. Below is the SQL for the same.
create external table ontime (
  Year INT,
  Month INT,
  DayofMonth INT,
  DayOfWeek INT,
  DepTime  INT,
  CRSDepTime INT,
  ArrTime INT,
  CRSArrTime INT,
  UniqueCarrier STRING,
  FlightNum INT,
  TailNum STRING,
  ActualElapsedTime INT,
  CRSElapsedTime INT,
  AirTime INT,
  ArrDelay INT,
  DepDelay INT,
  Origin STRING,
  Dest STRING,
  Distance INT,
  TaxiIn INT,
  TaxiOut INT,
  Cancelled INT,
  CancellationCode STRING,
  Diverted STRING,
  CarrierDelay INT,
  WeatherDelay INT,
  NASDelay INT,
  SecurityDelay INT,
  LateAircraftDelay INT
) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LOCATION 's3://airline-dataset/airline-csv/';

create external table ontime_parquet_snappy (
  Year INT,
  Month INT,
  DayofMonth INT,
  DayOfWeek INT,
  DepTime  INT,
  CRSDepTime INT,
  ArrTime INT,
  CRSArrTime INT,
  UniqueCarrier STRING,
  FlightNum INT,
  TailNum STRING,
  ActualElapsedTime INT,
  CRSElapsedTime INT,
  AirTime INT,
  ArrDelay INT,
  DepDelay INT,
  Origin STRING,
  Dest STRING,
  Distance INT,
  TaxiIn INT,
  TaxiOut INT,
  Cancelled INT,
  CancellationCode STRING,
  Diverted STRING,
  CarrierDelay INT,
  WeatherDelay INT,
  NASDelay INT,
  SecurityDelay INT,
  LateAircraftDelay INT
) STORED AS PARQUET LOCATION 's3://airline-dataset/airline-parquet-snappy/' TBLPROPERTIES ("orc.compress"="SNAPPY");

INSERT OVERWRITE TABLE ontime_parquet_snappy SELECT * FROM ontime;
The '2-run-queries-csv.sql' script will run the query on the ontime table which maps to the csv data. Below is the query.
INSERT OVERWRITE DIRECTORY 's3://airline-dataset/csv-query-output' select Origin, count(*) from ontime where DepTime > CRSDepTime group by Origin;
The '3-run-queries-parquet.sql' script will run the query on the ontime_parquet_snappy table which maps to the Parquet-Snappy data. Below is the query.
INSERT OVERWRITE DIRECTORY 's3://airline-dataset/parquet-snappy-query-output' select Origin, count(*) from ontime_parquet_snappy where DepTime > CRSDepTime group by Origin;
Step 3 : Goto the EMR management console and click on the 'Go to advanced options'.

Step 4 : Here select the software to be installed on the instances. For this blog we need Hadoop 2.7.3 and Hive 2.1.1. Make sure these are selected, the rest are optional. Here we can add a step. According to the AWS documentation, this is the definition of Step - 'Each step is a unit of work that contains instructions to manipulate data for processing by software installed on the cluster.'. This can be a MR program, Hive Query, Pig Script or something else. The steps can be added here or later. We will add steps later. Click on Next.

Step 5 : In this step, we can select the number of instances we want to run and the size of each instance. We will leave them as default and click on next.

Step 6 : In this step, we can select additional settings like the cluster name, the S3 log path location and so on. Make sure the 'S3 folder' points to the log folder in S3 and click on Next. Uncheck the 'Termination protection' option.

Step 7 : In this screen again all the default options are good enough. If we want to ssh into the EC2 instances then the 'EC2 key pair' has to be selected. Here are the instructions on how to create a key pair. Finally click on 'Create cluster' to launch the cluster.

Initially the cluster will be in a Starting state and the EC2 instances will be launched as shown below.

Within a few minutes, the cluster will be in a running state and the Steps (data processing programs) can be added to the cluster.

Step 8 : Add a Step to the cluster by clicking on the 'Add step' and pointing to the '1-create-tables-move-data.sql' file as shown below and click on Add. The processing will start on the cluster.

The Step will be in a Pending status for some time and then move to the Completed status after the processing has been done.

Once the processing has been complete the csv data will be converted into a Parquet format with Snappy compression and put into S3 as shown below.

Note that the csv data was close to 2,192 MB and the Parquet Snappy data is around 190 MB. The Parquet data is in columnar format and provides higher compression compared to the csv format. This enables to fit more data into the memory and so quicker results.

Step 9 : Now add 2 more steps using the '2-run-queries-csv.sql' and the '3-run-queries-parquet.sql'. The first sql file will run the query on the csv data table and the second will run the query on the Parquet Snappy table. Both the queries are the same, returning the same results in S3.

Step 10 : Check the step log files for the steps to get the execution times in the EMR management console.

Converting the CSV to Parquet Snappy format - 148 seconds
Executing the query on the csv data - 90 seconds
Executing the query on the Parquet Snappy data - 56 seconds

Note that the query runs faster on the Parquet Snappy data, when compared to the csv data. I was expecting the query to run a bit faster, need to look into this a bit more.

Step 11 : Now that the processing has been done, it's time to terminate the cluster. Click on Terminate and again on Terminate. It will take a few minutes for the cluster to terminate.

Note that the EMR cluster will be terminated and EC2 instances will also terminate.

Step 12 : Go back to S3 management console the below folders should be there. Clean up by deleteing the bucket. I would be keeping the data, so that I can try Athena and RedShift on the CSV and the Parquet Snappy data. Note that 5GB of S3 data can be stored for free upto one year. More details about the AWS free tier here.

In the future blogs, we will look at processing the same data using AWS Athena. With Athena there is no need to spawn a cluster, so the serverless model. AWS Athena will automatically spawn the servers. We simply create a table, map it to the data in S3 and run the SQL queries on it.

With the EMR the pricing is rounded to the hour and for executing a query about 1 hour and 5 minutes, we need to pay for complete 2 hours. With Athena we pay by the amount of the data scanned. So, changing the data into a columnar format and compressing it, will not only make the query run a bit faster, but also cut down the bill.

UpdateHere and here are articles from the AWS documentation on the same. It has got some additional commands.

Friday, June 16, 2017

GCP for AWS Professionals

As many of you know I had been working with AWS (public Cloud from Amazon) for quite some time and so I thought of getting my feet wet with GCP (public Cloud from Google). I tried to find some free (????) MOOC around GCP and didn't find many.

Google partnered with Coursera and started a MOOC on the GCP fundamentals. It covers the different GCP fundamentals at a very high level with a few demos. Here is the link for the same. There is also documentation from Google comparing the GCP and the AWS platform here.

As I was going through the above mentioned MOOC, I could clearly map many of the GCP services with the AWS services which I am more comfortable with. Here is the mapping between the different GCP and the AWS services. The mappings are really helpful for those who are comfortable with one of the Cloud platform and want to get familiar with the other.

AWS provides free resources for the developer to get more familiar and to get started with their platform. Same is the case with GCP, a 300$ credit which is valid for 1 year is provided. Both the Cloud vendors provide a few services for free for the first year and there are some services will are free life long with some usage limitations.

Maybe it's my perspective, but the content around AWS seems to be much more robust and organized when compared to the GCP documentation. Same is the case with the Web UI also. Anyway, here is documentation for AWS and here is the documentation for AWS.

Happy clouding.

AWS Lambda with Serverless freamework

Google CloudFunction, IBM OpenWhisk, Azure Functions and AWS Lambda allow building applications in a Serverless fashion. Serverless doesn't mean that there is no server, but the Cloud vendor takes care of the provisioning the servers, scaling them up and down as required. In the Serverless model all we have to do is author a function with the event that triggers and the function and the resources the function uses.

Since, no servers are being provisioned there is no constant cost. The cost is directly proportional to the number of times the particular function is called and the amount of resources it consumes. This model is useful as for a startup with limited resources or for a big company who want to deploy the applications quickly.

Here is a nice video form the founder of aCloudGuru on how they built their entire business on Serverless Architecture. Here is a nice article/tutorial from AWS on using Lambda function to shrink the image uploaded to S3 bucket.

In the above workflow as soon as a image is uploaded to S3 Source Bucket, it will fire an event and call the Lambda function. The function will shrink the image and then put it in the S3 Target Bucket. In this scenario, there is no sustained cost associated as we pay based on the number of times the function is called and the amount of resources it consumes.

For the last few days I was intrigued by the Serverless architecture and had been trying the use cases mentioned in AWS Lambda documentation here. It was fun, but not pretty straight forward or as simple as uploading a function. The function has to be authored, packaged, uploaded, set the permissions for the event/resources, test the Lambda and finally integrate with the API Gatway. oooops. It's not an impossible task, but definitely tedious. Good that I haven't mentioned the debugging of the Lambda functions, which is really a pain.

For the rescue is the Serverless framework which makes working with the Lambda functions easy. Setting up the Serverless frameowork on Ubuntu was a breeze. Creating a HelloWorld Lambda function in AWS with all the required dependencies was even easier.

Note that the Serverless framework supports other platforms besides AWS. But, in the this blog I will provide some screen shots with a brief write-up on the same. Here, I go with the assumption that Serverless framework and all the dependencies like NodeJS, integration with the AWS Security Credentials has been done.

So, here are the steps.

Step1: Create the skeleton code and the Serverless Lambda configuration file.

Step 2: Deploy the code to AWS using the CloudFormation.

Once the deployment has been complete, the AWS resources will be created as shown below in the CloudFormation console.

As seen in the below screen, the Lambda function also gets deployed.

Step 3 : Now is the time to invoke the Lambda function again using the Serverless framework as shown below.

In the Lambda management console, we can observe that the Lambda function was called once.

Step 4 : Undeploy the Lambda function using Serverless framework.

The CloudFormation stack should be removed and all the associated resources should be deleted.

These completes the steps for a simple demo of the Serverless framework. A couple of things to note that we simply uploaded a Python function to the AWS Cloud and never created a server and so the Serverless Architecture. Here a few more resources to get started with Serverless Architecture.

On a side note I was really surprised that the Serverless framework has been started by CocaCola and then at some point of time they decided to open source in Github. All the time it was companies like Facebook, Google, Twitter, LinkedIn, Netflix, Microsoft :) opening the internal software to the public, but this was the first time I saw something from CocaCola.

Maybe I should drink more CocaCola and promote them to publish such cool frameworks.

Wednesday, May 31, 2017

New AWS Certifications on Big Data and Advanced Networking

AWS announced AWS Certified Big Data – Specialty and the AWS Certified Advanced Networking - Specialty. These certifications had been in the Beta and had been opened to the public today. The exam guide and the sample questions can be downloaded from the certification home pages mentioned earlier. One of the associate level certifications (1, 2, 3) have to be cleared to take this certification.

There are two online courses for helping getting through the the AWS Big Data Certification (here and here). I am not exactly sure, how much these course covers the various topics around the Big Data certification, but it never hurts to get one of them.

I am comfortable with Big Data and the AWS Cloud, so I am planning to attempt the `AWS Certified Big Data – Specialty` ASAP. I will also update with the tips and my experience with the certifications. I am really excited about this certification.

Best of luck !!!

Friday, May 19, 2017

AWS and Big Data Training

I have been into training line (along with consulting and projects) for a few years, both online and classroom, using a Wacom Tablet which I bought 5 years back. The Tablet helps me to convey the message better and quick, whether the participant is in some other part of world or in front of me. I am really passionate about the trainings, as it helps me think about a particular concept from different perspectives.

So, here is an AWS demo on how to create a ELB (Elastic Load Balancer) and EC2 (Linux Servers) in the Cloud. This particular combination can be used for the sake of High Availability or Scaling Up as more and more users start using your application.

Similar, below is the Big Data Demo on creating a HBase Observer, which is very similar to a Trigger in RDBMS. BTW, those who are new to HBase, it's a NoSQL Database of Columnar type. There are a lot of NoSQL databases and HBase is one of them.

If you are interested in any of these trainings, please contact me at praveensripati@gmail.com for further details. Both Big Data and Cloud are on the rage, so don't miss the opportunities around them.

See you soon !!!

Wednesday, May 3, 2017

Attaching an EBS Disk to a Linux Instance

In the previous blog, we looked the sequence of steps to create a Linux instance and log into it. In this blog, we will create a new hard disk (actually an EBS) and attach it to the Linux instance. I am going with the assumption that the EC2 instance has already been created.

1. Goto the EC2 management console and click on the Volumes in the left pane. And then click on `Create Volume`. Change the volume size to 1GB and click on Create.

2. It takes a couple of seconds, but there will be two EBS volumes. An 8 GB volume (in-use) which was automatically created at the the time EC2 creation. Another 1 GB volume (attached) which was created in the above step.

3. Select the 1 GB volume. Click on Actions and then Attach Volume. We have created a Linux EC2 instance and an EBS volume, now we need to attach these two.

4. Move the cursor to the instance box and select the EC2 instance to which the EBS has to be attached and then click Attach. The state of the 1GB instance should change from available to in-use.

5. Get the ip address of the EC2 instance and login to it using Putty as mentioned here.

6. Change to root (sudo su) and get the list of the partition tables (fdisk -l). The commands are mentioned in the parentheses. Note that the device name is /dev/xvdf from the below commands. It may vary, so the previous mentioned commands have to be run.

7. Build the Linux filesystem (mkfs /dev/xvdf), create a folder (mkdir /mnt/disk100) and finally mount the file system (mount /dev/xvdf /mnt/disk100). The commands are mentioned in the parentheses. Note that, I have chosen to create disk100 folder, you can replace with any other folder name.

Now the device has been mounted to /mnt/disk100 folder, the data in this folder will be written to the 1 GB EBS volume which we have created in one of the previous step. Even after stopping the EC2 instance, the data will be still there in the EBS instance. This EBS can be attached to another EC2 instance also.

Note that in the case of AWS, an EBS volume cannot be attached to multiple EC2 instances. But, the same thing can be done in the Google Cloud Platform.

Don't forget to terminate the Linux EC2 instance and delete the EBS volume. In the next blog, we will attach an EBS volume to an Windows EC2 instance, which is a bit more easier.

Wednesday, April 26, 2017

Scratch programming for kids

It's summer holidays and I had been helping my kid (8 years) get started with programming. So I bought the Computer Coding for Kids book by Carol Vorderman. The books covers initially Scratch (from MIT) and then Python from the basics. Scratch is a visual programming language which is easy for those who are just getting started with programming. No need for typing code, more of drag and drop.

So, here is his first program (link here), a small and cute game in Scratch. Click on the flag on the top right, to get the game started. He was all excited to get it published in this blog. One thing to notice is that the execution of the game requires Adobe Flash, which is slowly getting out of fashion with the different browsers because of the vulnerabilities and stability issues. For some reason, I was able to get it run only in IE and not in Edge/Chrome/Firefox browsers.

Scratch looks a bit basic, but I have seen some interesting programs developed in it. I would recommend the above mentioned book for anyone who wants to get started with programming. Carol has written a few more books, which I plan to buy once we complete this book.

Monday, April 24, 2017

Creating a Linux AMI from an EC2 instance

In one of the earlier blog, we created a Linux EC2 instance with a static page. We installed and started the httpd server, created a simple index.html in the /var/www/html folder. The application in the EC2 is too basic, but definitely we can build complex applications.

Lets say, we need 10 such instances. Then it's not necessary to install the softwares, do the necessary configurations multiple times. What we can do is create an AMI, once the softwares have been installed and the necessary configuration changes have been made. The AMI will have the OS and the required softwares with the appropriate configurations. The same AMI can be used while launching the new EC2 instances. More about the AMI here.

So, here are the steps

1. Create a Linux EC2 instance with the httpd server and index.html as shown here.

2. Once the different softwares have been installed and configured, the AMI can be created as shown below.

3. Give the image name and description and click on `Create Image`.

4. It takes a couple of minutes to create the AMI. The status of the same can be seen by clicking on the AMI link in the left side pane. Initially the AMI will be in a pending state, after a few minutes it will change into available state.

5. On launch a new EC2 instance, we can select the new AMI which was created in the above steps in the `My AMIs` tab. The AMI has all the softwares and configurations in it, so there is no need to repeat the same thing again.

6. As we are simply learning/trying things and not anything in production, make sure to a) terminate all the EC2 instances b) dregister the AMI and c) delete the Snapshot.

Now that we know how to create an AMI with all the required softwares/configurations/data, we will look at AutoScaling and ELB (Elastic Load Balancers) in the upcoming blogs.

Creating an alarm in CloudWatch for a Linux EC2 instance

AWS CloudWatch does a couple of things as mentioned in the AWS documentation here. But, one of the interesting thing is that it can gather the metrics (CPU, Network, Disk etc) for the different Amazon resources.

In this blog, we will look how to look at the CPU metrics on a Linux EC2 instance and trigger an Alarm with a corresponding action. To make it simple, if the CPU Utilization in the Linux instance goes beyond 50% then we will be notified through an email. We can watch different metrics, we will use CPU Utilization for now. CloudWatch also allows to create custom metrics, like the application response time.

I will go with the assumption that the Linux instance has already been created as shown in this blog. So, here are the steps.

1. In the Linux instance make sure that the detailed monitoring is enabled. If not then enable it from the EC2 management console as shown below.

2. Go to the CloudWatch management console and select Metrics on the left pane. Select EC2 in the the `All Metrics` tab.

3. Again select the `Per-Instance Metrics` from the `All Metrics` tab.

4. Go back to the EC2 management console and get the `Instance Id` as shown below.

5. Go back to the CloudWatch management console. Search for the InstanceId and select CPUUtilization. Now the line graph will be populated as shown below for the CPUUtilization. We should be also able to select multiple metrics at the same time.

6. The graph is shown with a 5 minute granularity. By changing it to minute granularity, we would be getting a much smoother graph. Click on `Graphed Metrics` tab and change the Period to `1 minute`.

7. Now, lets try to create an Alarm. Click on Alarm in the left pane and then on the `Create Alarm` button.

8. Click on the `Per-Instance Metrics` under EC2. Search for the EC2 instance id and select CPUUtilization as shown below.

9. Now that the metric has been selected, we have to define the alarm. Click on `Next`. Specify the Name, Description and CPU % as shown below.

10. When the alarm is breached (CPU > 50%), then the notification has to be done. In the same screen, go a bit down and specify the notification by click on `New list` as shown below.

11. Click on `Create Alarm` in the same screen. To avoid spamming, AWS will send an email with the confirmation link which has to be clicked. Go ahead and check your email for confirmation.

12. Once the link has been clicked in the email, we will get a confirmation as below.

13. Now, we are all setup to receive an email notification when the CPU > 50%. Login to the Linux instance as mentioned in this blog and run the below command to increase the CPU on the Linux instance. The command doesn't do anything useful, but simply increases the CPU usage. There are more ways to increase the CPU usage, more here.
dd if=/dev/urandom | bzip2 -9 >> /dev/null

14. If we go back to the Metrics then we will notice that the CPU has spiked on the Linux instance because of the above command.

15. And the Alarm also moved from the OK status to the ALARM status as shown below.

16. Check your email and there would be a notification from AWS about the breach of the Alarm threshold.

17. Make sure to delete the Alarm and terminate the Linux instance.

The procedure is a bit too long and it took me good amount of time to write this blog, but once you do it a couple of times and understand whats being done, then it would be a piece of cake.