Monday, July 3, 2017

Accessing the EMR Web Consoles

In the previous blog, we looked on how to start a AWS EMR cluster and run a Hive Script. Once the cluster has been started, it does provide a web console to check the status of the cluster and also to see the progress of the different data processing tasks. By default, the web consoles are blocked for the sake of security.

Below are the URLs of some of the web consoles.
YARN ResourceManager  http://master-public-dns-name:8088/
Hadoop HDFS NameNode  http://master-public-dns-name:50070/
Spark HistoryServer  http://master-public-dns-name:18080/
Zeppelin   http://master-public-dns-name:8890/
Hue    http://master-public-dns-name:8888/
Ganglia   http://master-public-dns-name/ganglia/
HBase UI   http://master-public-dns-name:16010/

YARN NodeManager  http://slave-public-dns-name:8042/
Hadoop HDFS DataNode  http://slave-public-dns-name:50075/
In this blog, we will be exploring on how to access the web consoles. The AWS documentation for the same is here.

Step 1 : Start the EMR cluster as shown in the previous blog.

Step 2 : Setup a ssh tunnel to the master using local port forwarding using the below command. Here the local port 8157 is being forwarded to the remote port 8088. The port 8157 can be replaced by any free local port and 8088 is the port on which the YARN console is available. Port 8088 can be replaced by the port of the Web console which we want to access.
ssh -i /home/praveen/Documents/AWS-Keys/MyKeyPair.pem -N -L

In the above command replace the following

a) the path of the key pair
b) the DNS name of the master node (twice)

Step 3 : Access the YARN console from the browser from the same machine the Step 1 has been performed.

An alternate for the above step is to make changes to the master Security Group and allow inbound 8088, which is the YARN Web Console port number.

Tuesday, June 27, 2017

Processing the Airline dataset with AWS Athena

AWS Athena is a interactive query engine to process the data in S3. Athena is based on Presto which was developed by Facebook and then open sourced. With Athena there is no need to start a cluster, spawn EC2 instances. Simply create a table, point it to the data in S3 and run the queries.

In the previous blog, we looked at converting the Airline dataset from the original csv format to the columnar format and then run SQL queries on the two data sets using Hive/EMR combination. In this blog we will process the same data sets using Athena. So, here are the steps.

Step 1 : Go to the Athena Query Editor and create the ontime and the ontime_parquet_snappy table as shown below. The DDL queries for creating these two tables can be got from this blog.

Step 2 : Run the query on the ontime and the ontime_parquet_snappy table as shown below. Again the queries can be got from the blog mentioned in Step 1.

Note that, for processing the csv data it took 3.56 seconds and 2.14 GB of S3 data was scanned. For processing the Parquet Snappy data it took 3.07 seconds and 46.21 MB of S3 data was scanned.

There is not a significant time difference running the queries on the two datasets. But, Athena pricing is based on the amount of data scanned in the S3. So, the cost is significantly less to process the Parquet Snappy data than the csv data.

Step 3 : Go to the Catalog Manager and drop the tables. Dropping them will simply delete the table definition, but not associated data in S3.

Just out of curiosity I created the two tables again and run a different query this time. Below are the queries with the metrics.
select distinct(origin) from ontime_parquet_snappy;
Run time: 2.33 seconds, Data scanned: 4.76MB

select distinct(origin) from ontime;
Run time: 1.93 seconds, Data scanned: 2.14GB

As usual the there is not much difference in the time taken for the query execution, but the amount of data scanned in S3 for the Parquet Snappy data is significantly lower. So, the cost to run the query on the Parquet Snappy format data is significantly less.

Friday, June 23, 2017

Algorithmia - a store for algorithms, models and functions

I came across Algorithmia a few months back and didn't get a change to try it out. Again it came into focus with a Series A funding of $10.5M. More about the funding here.

Algorithmia  is a place where algorithms, models or functions can be discovered and be used for credits which we can buy. We get 5,000 credits every month for free. For example if a model costs 20 credits, then it can be called 250 times a month.

Create a free account here and get the API key from the profile. Now we should be able to call the different models using different languages like Python, Java, R and commands like curl. Below are the curl commands to do a sentimental analysis on a sentence. Make sure to replace the API_KEY with your own key.

curl -X POST -d '{"sentence": "I really like this website called algorithmia"}' -H 'Content-Type: application/json' -H 'Authorization: Simple API_KEY'

{"result":[{"compound":0.4201,"negative":0,"neutral":0.642,"positive":0.358,"sentence":"I really like this website called algorithmia"}],"metadata":{"content_type":"json","duration":0.010212005}}

curl -X POST -d '{"sentence": "I really dont like this website called algorithmia"}' -H 'Content-Type: application/json' -H 'Authorization: Simple API_KEY'

{"result":[{"compound":-0.3374,"negative":0.285,"neutral":0.715,"positive":0,"sentence":"I really dont like this website called algorithmia"}],"metadata":{"content_type":"json","duration":0.009965723}}
Algorithmia is more like a Google Play Store and Apple App Store, where individuals and companies can upload mobile applications and rest of us can download the same. It's an attempt to democratize Artificial Intelligence and Machine Learning.

Here is a service to convert the black and white to color images.

Monday, June 19, 2017

Converting Airline dataset from the row format to columnar format using AWS EMR

To process Big Data huge number of machines are required. Instead of buying them, it's better to process the data in the Cloud as it provides lower CAPEX and OPEX costs. In this blog we will at processing the airline data set in the AWS EMR (Elastic MapReduce). EMR provides Big Data as a service. We don't need to worry about installing, configuring, patching, security aspects of the Big Data software. EMR takes care of them, just we need specify the size and the number of the machines in the cluster, the location of the input/output data and finally the program to run. It's as easy as this.

The Airline dataset is in a csv format which is efficient for fetching the data in a row wise format based on some condition. But, not really efficient when we want to do some aggregations. So, we would be converting the CSV data into Parquet format and then run the same queries on the csv and Parquet format to observe the performance improvements.

Note that using the AWS EMR will incur cost and doesn't fall under the AWS free tier as we would be launching not the t2.micro EC2 instances, but a bit bigger EC2 instances. I will try to keep the cost to the minimum as this is a demo. Also, I prepared the required scripts ahead and tested them in the local machine on small data sets instead of the AWS EMR. This will save the AWS expenses.

So, here are the steps

Step 1 : Download the Airline data set from here and uncompress the same. All the data sets can be downloaded and uncompressed. But, to keep the cost to the minimum I downloaded the 1987, 1989, 1991, 1993 and 2007 related data and uploaded to S3 as shown below.

Step 2 : Create a folder called scripts and upload them to S3.

The '1-create-tables-move-data.sql' script will create the ontime and the ontime_parquet_snappy table, map the data to the table and finally move the data from the ontime table to the ontime_parquet_snappy table after transforming the data from the csv to the Parquet format. Below is the SQL for the same.
create external table ontime (
  Year INT,
  Month INT,
  DayofMonth INT,
  DayOfWeek INT,
  DepTime  INT,
  CRSDepTime INT,
  ArrTime INT,
  CRSArrTime INT,
  UniqueCarrier STRING,
  FlightNum INT,
  TailNum STRING,
  ActualElapsedTime INT,
  CRSElapsedTime INT,
  AirTime INT,
  ArrDelay INT,
  DepDelay INT,
  Origin STRING,
  Dest STRING,
  Distance INT,
  TaxiIn INT,
  TaxiOut INT,
  Cancelled INT,
  CancellationCode STRING,
  Diverted STRING,
  CarrierDelay INT,
  WeatherDelay INT,
  NASDelay INT,
  SecurityDelay INT,
  LateAircraftDelay INT
) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LOCATION 's3://airline-dataset/airline-csv/';

create external table ontime_parquet_snappy (
  Year INT,
  Month INT,
  DayofMonth INT,
  DayOfWeek INT,
  DepTime  INT,
  CRSDepTime INT,
  ArrTime INT,
  CRSArrTime INT,
  UniqueCarrier STRING,
  FlightNum INT,
  TailNum STRING,
  ActualElapsedTime INT,
  CRSElapsedTime INT,
  AirTime INT,
  ArrDelay INT,
  DepDelay INT,
  Origin STRING,
  Dest STRING,
  Distance INT,
  TaxiIn INT,
  TaxiOut INT,
  Cancelled INT,
  CancellationCode STRING,
  Diverted STRING,
  CarrierDelay INT,
  WeatherDelay INT,
  NASDelay INT,
  SecurityDelay INT,
  LateAircraftDelay INT
) STORED AS PARQUET LOCATION 's3://airline-dataset/airline-parquet-snappy/' TBLPROPERTIES ("orc.compress"="SNAPPY");

INSERT OVERWRITE TABLE ontime_parquet_snappy SELECT * FROM ontime;
The '2-run-queries-csv.sql' script will run the query on the ontime table which maps to the csv data. Below is the query.
INSERT OVERWRITE DIRECTORY 's3://airline-dataset/csv-query-output' select Origin, count(*) from ontime where DepTime > CRSDepTime group by Origin;
The '3-run-queries-parquet.sql' script will run the query on the ontime_parquet_snappy table which maps to the Parquet-Snappy data. Below is the query.
INSERT OVERWRITE DIRECTORY 's3://airline-dataset/parquet-snappy-query-output' select Origin, count(*) from ontime_parquet_snappy where DepTime > CRSDepTime group by Origin;
Step 3 : Goto the EMR management console and click on the 'Go to advanced options'.

Step 4 : Here select the software to be installed on the instances. For this blog we need Hadoop 2.7.3 and Hive 2.1.1. Make sure these are selected, the rest are optional. Here we can add a step. According to the AWS documentation, this is the definition of Step - 'Each step is a unit of work that contains instructions to manipulate data for processing by software installed on the cluster.'. This can be a MR program, Hive Query, Pig Script or something else. The steps can be added here or later. We will add steps later. Click on Next.

Step 5 : In this step, we can select the number of instances we want to run and the size of each instance. We will leave them as default and click on next.

Step 6 : In this step, we can select additional settings like the cluster name, the S3 log path location and so on. Make sure the 'S3 folder' points to the log folder in S3 and click on Next. Uncheck the 'Termination protection' option.

Step 7 : In this screen again all the default options are good enough. If we want to ssh into the EC2 instances then the 'EC2 key pair' has to be selected. Here are the instructions on how to create a key pair. Finally click on 'Create cluster' to launch the cluster.

Initially the cluster will be in a Starting state and the EC2 instances will be launched as shown below.

Within a few minutes, the cluster will be in a running state and the Steps (data processing programs) can be added to the cluster.

Step 8 : Add a Step to the cluster by clicking on the 'Add step' and pointing to the '1-create-tables-move-data.sql' file as shown below and click on Add. The processing will start on the cluster.

The Step will be in a Pending status for some time and then move to the Completed status after the processing has been done.

Once the processing has been complete the csv data will be converted into a Parquet format with Snappy compression and put into S3 as shown below.

Note that the csv data was close to 2,192 MB and the Parquet Snappy data is around 190 MB. The Parquet data is in columnar format and provides higher compression compared to the csv format. This enables to fit more data into the memory and so quicker results.

Step 9 : Now add 2 more steps using the '2-run-queries-csv.sql' and the '3-run-queries-parquet.sql'. The first sql file will run the query on the csv data table and the second will run the query on the Parquet Snappy table. Both the queries are the same, returning the same results in S3.

Step 10 : Check the step log files for the steps to get the execution times in the EMR management console.

Converting the CSV to Parquet Snappy format - 148 seconds
Executing the query on the csv data - 90 seconds
Executing the query on the Parquet Snappy data - 56 seconds

Note that the query runs faster on the Parquet Snappy data, when compared to the csv data. I was expecting the query to run a bit faster, need to look into this a bit more.

Step 11 : Now that the processing has been done, it's time to terminate the cluster. Click on Terminate and again on Terminate. It will take a few minutes for the cluster to terminate.

Note that the EMR cluster will be terminated and EC2 instances will also terminate.

Step 12 : Go back to S3 management console the below folders should be there. Clean up by deleteing the bucket. I would be keeping the data, so that I can try Athena and RedShift on the CSV and the Parquet Snappy data. Note that 5GB of S3 data can be stored for free upto one year. More details about the AWS free tier here.

In the future blogs, we will look at processing the same data using AWS Athena. With Athena there is no need to spawn a cluster, so the serverless model. AWS Athena will automatically spawn the servers. We simply create a table, map it to the data in S3 and run the SQL queries on it.

With the EMR the pricing is rounded to the hour and for executing a query about 1 hour and 5 minutes, we need to pay for complete 2 hours. With Athena we pay by the amount of the data scanned. So, changing the data into a columnar format and compressing it, will not only make the query run a bit faster, but also cut down the bill.

UpdateHere and here are articles from the AWS documentation on the same. It has got some additional commands.

Friday, June 16, 2017

GCP for AWS Professionals

As many of you know I had been working with AWS (public Cloud from Amazon) for quite some time and so I thought of getting my feet wet with GCP (public Cloud from Google). I tried to find some free (????) MOOC around GCP and didn't find many.

Google partnered with Coursera and started a MOOC on the GCP fundamentals. It covers the different GCP fundamentals at a very high level with a few demos. Here is the link for the same. There is also documentation from Google comparing the GCP and the AWS platform here.

As I was going through the above mentioned MOOC, I could clearly map many of the GCP services with the AWS services which I am more comfortable with. Here is the mapping between the different GCP and the AWS services. The mappings are really helpful for those who are comfortable with one of the Cloud platform and want to get familiar with the other.

AWS provides free resources for the developer to get more familiar and to get started with their platform. Same is the case with GCP, a 300$ credit which is valid for 1 year is provided. Both the Cloud vendors provide a few services for free for the first year and there are some services will are free life long with some usage limitations.

Maybe it's my perspective, but the content around AWS seems to be much more robust and organized when compared to the GCP documentation. Same is the case with the Web UI also. Anyway, here is documentation for AWS and here is the documentation for AWS.

Happy clouding.

AWS Lambda with Serverless framework

Google CloudFunction, IBM OpenWhisk, Azure Functions and AWS Lambda allow building applications in a Serverless fashion. Serverless doesn't mean that there is no server, but the Cloud vendor takes care of the provisioning the servers, scaling them up and down as required. In the Serverless model all we have to do is author a function with the event that triggers and the function and the resources the function uses.

Since, no servers are being provisioned there is no constant cost. The cost is directly proportional to the number of times the particular function is called and the amount of resources it consumes. This model is useful as for a startup with limited resources or for a big company who want to deploy the applications quickly.

Here is a nice video form the founder of aCloudGuru on how they built their entire business on Serverless Architecture. Here is a nice article/tutorial from AWS on using Lambda function to shrink the image uploaded to S3 bucket.

In the above workflow as soon as a image is uploaded to S3 Source Bucket, it will fire an event and call the Lambda function. The function will shrink the image and then put it in the S3 Target Bucket. In this scenario, there is no sustained cost associated as we pay based on the number of times the function is called and the amount of resources it consumes.

For the last few days I was intrigued by the Serverless architecture and had been trying the use cases mentioned in AWS Lambda documentation here. It was fun, but not pretty straight forward or as simple as uploading a function. The function has to be authored, packaged, uploaded, set the permissions for the event/resources, test the Lambda and finally integrate with the API Gatway. oooops. It's not an impossible task, but definitely tedious. Good that I haven't mentioned the debugging of the Lambda functions, which is really a pain.

For the rescue is the Serverless framework which makes working with the Lambda functions easy. Setting up the Serverless frameowork on Ubuntu was a breeze. Creating a HelloWorld Lambda function in AWS with all the required dependencies was even easier.

Note that the Serverless framework supports other platforms besides AWS. But, in the this blog I will provide some screen shots with a brief write-up on the same. Here, I go with the assumption that Serverless framework and all the dependencies like NodeJS, integration with the AWS Security Credentials has been done.

So, here are the steps.

Step1: Create the skeleton code and the Serverless Lambda configuration file.

Step 2: Deploy the code to AWS using the CloudFormation.

Once the deployment has been complete, the AWS resources will be created as shown below in the CloudFormation console.

As seen in the below screen, the Lambda function also gets deployed.

Step 3 : Now is the time to invoke the Lambda function again using the Serverless framework as shown below.

In the Lambda management console, we can observe that the Lambda function was called once.

Step 4 : Undeploy the Lambda function using Serverless framework.

The CloudFormation stack should be removed and all the associated resources should be deleted.

These completes the steps for a simple demo of the Serverless framework. A couple of things to note that we simply uploaded a Python function to the AWS Cloud and never created a server and so the Serverless Architecture. Here a few more resources to get started with Serverless Architecture.

On a side note I was really surprised that the Serverless framework has been started by CocaCola and then at some point of time they decided to open source in Github. All the time it was companies like Facebook, Google, Twitter, LinkedIn, Netflix, Microsoft :) opening the internal software to the public, but this was the first time I saw something from CocaCola.

Maybe I should drink more CocaCola and promote them to publish such cool frameworks.

Wednesday, May 31, 2017

New AWS Certifications on Big Data and Advanced Networking

AWS announced AWS Certified Big Data – Specialty and the AWS Certified Advanced Networking - Specialty. These certifications had been in the Beta and had been opened to the public today. The exam guide and the sample questions can be downloaded from the certification home pages mentioned earlier. One of the associate level certifications (1, 2, 3) have to be cleared to take this certification.

There are two online courses for helping getting through the the AWS Big Data Certification (here and here). I am not exactly sure, how much these course covers the various topics around the Big Data certification, but it never hurts to get one of them.

I am comfortable with Big Data and the AWS Cloud, so I am planning to attempt the `AWS Certified Big Data – Specialty` ASAP. I will also update with the tips and my experience with the certifications. I am really excited about this certification.

Best of luck !!!

Friday, May 19, 2017

AWS and Big Data Training

I have been into training line (along with consulting and projects) for a few years, both online and classroom, using a Wacom Tablet which I bought 5 years back. The Tablet helps me to convey the message better and quick, whether the participant is in some other part of world or in front of me. I am really passionate about the trainings, as it helps me think about a particular concept from different perspectives.

So, here is an AWS demo on how to create a ELB (Elastic Load Balancer) and EC2 (Linux Servers) in the Cloud. This particular combination can be used for the sake of High Availability or Scaling Up as more and more users start using your application.

Similar, below is the Big Data Demo on creating a HBase Observer, which is very similar to a Trigger in RDBMS. BTW, those who are new to HBase, it's a NoSQL Database of Columnar type. There are a lot of NoSQL databases and HBase is one of them.

If you are interested in any of these trainings, please contact me at for further details. Both Big Data and Cloud are on the rage, so don't miss the opportunities around them.

See you soon !!!

Wednesday, May 3, 2017

Attaching an EBS Disk to a Linux Instance

In the previous blog, we looked the sequence of steps to create a Linux instance and log into it. In this blog, we will create a new hard disk (actually an EBS) and attach it to the Linux instance. I am going with the assumption that the EC2 instance has already been created.

1. Goto the EC2 management console and click on the Volumes in the left pane. And then click on `Create Volume`. Change the volume size to 1GB and click on Create.

2. It takes a couple of seconds, but there will be two EBS volumes. An 8 GB volume (in-use) which was automatically created at the the time EC2 creation. Another 1 GB volume (attached) which was created in the above step.

3. Select the 1 GB volume. Click on Actions and then Attach Volume. We have created a Linux EC2 instance and an EBS volume, now we need to attach these two.

4. Move the cursor to the instance box and select the EC2 instance to which the EBS has to be attached and then click Attach. The state of the 1GB instance should change from available to in-use.

5. Get the ip address of the EC2 instance and login to it using Putty as mentioned here.

6. Change to root (sudo su) and get the list of the partition tables (fdisk -l). The commands are mentioned in the parentheses. Note that the device name is /dev/xvdf from the below commands. It may vary, so the previous mentioned commands have to be run.

7. Build the Linux filesystem (mkfs /dev/xvdf), create a folder (mkdir /mnt/disk100) and finally mount the file system (mount /dev/xvdf /mnt/disk100). The commands are mentioned in the parentheses. Note that, I have chosen to create disk100 folder, you can replace with any other folder name.

Now the device has been mounted to /mnt/disk100 folder, the data in this folder will be written to the 1 GB EBS volume which we have created in one of the previous step. Even after stopping the EC2 instance, the data will be still there in the EBS instance. This EBS can be attached to another EC2 instance also.

Note that in the case of AWS, an EBS volume cannot be attached to multiple EC2 instances. But, the same thing can be done in the Google Cloud Platform.

Don't forget to terminate the Linux EC2 instance and delete the EBS volume. In the next blog, we will attach an EBS volume to an Windows EC2 instance, which is a bit more easier.

Wednesday, April 26, 2017

Scratch programming for kids

It's summer holidays and I had been helping my kid (8 years) get started with programming. So I bought the Computer Coding for Kids book by Carol Vorderman. The books covers initially Scratch (from MIT) and then Python from the basics. Scratch is a visual programming language which is easy for those who are just getting started with programming. No need for typing code, more of drag and drop.

So, here is his first program (link here), a small and cute game in Scratch. Click on the flag on the top right, to get the game started. He was all excited to get it published in this blog. One thing to notice is that the execution of the game requires Adobe Flash, which is slowly getting out of fashion with the different browsers because of the vulnerabilities and stability issues. For some reason, I was able to get it run only in IE and not in Edge/Chrome/Firefox browsers.

Scratch looks a bit basic, but I have seen some interesting programs developed in it. I would recommend the above mentioned book for anyone who wants to get started with programming. Carol has written a few more books, which I plan to buy once we complete this book.

Monday, April 24, 2017

Creating a Linux AMI from an EC2 instance

In one of the earlier blog, we created a Linux EC2 instance with a static page. We installed and started the httpd server, created a simple index.html in the /var/www/html folder. The application in the EC2 is too basic, but definitely we can build complex applications.

Lets say, we need 10 such instances. Then it's not necessary to install the softwares, do the necessary configurations multiple times. What we can do is create an AMI, once the softwares have been installed and the necessary configuration changes have been made. The AMI will have the OS and the required softwares with the appropriate configurations. The same AMI can be used while launching the new EC2 instances. More about the AMI here.

So, here are the steps

1. Create a Linux EC2 instance with the httpd server and index.html as shown here.

2. Once the different softwares have been installed and configured, the AMI can be created as shown below.

3. Give the image name and description and click on `Create Image`.

4. It takes a couple of minutes to create the AMI. The status of the same can be seen by clicking on the AMI link in the left side pane. Initially the AMI will be in a pending state, after a few minutes it will change into available state.

5. On launch a new EC2 instance, we can select the new AMI which was created in the above steps in the `My AMIs` tab. The AMI has all the softwares and configurations in it, so there is no need to repeat the same thing again.

6. As we are simply learning/trying things and not anything in production, make sure to a) terminate all the EC2 instances b) dregister the AMI and c) delete the Snapshot.

Now that we know how to create an AMI with all the required softwares/configurations/data, we will look at AutoScaling and ELB (Elastic Load Balancers) in the upcoming blogs.