Thursday, August 17, 2017

Creating a Thumbnail using AWS Lambda (Serverless Architecture)

Introduction to AWS Lambda

In one of the earlier blog here, we discussed about AWS Lambda which is a FAAS (Function As A Service) with a simple example. In Lambda the granularity is at a function level and the pricing is also on the number of times a function is called and so is directly proportional to the growth of the business. We don't need to think in terms of the servers (serverless architecture), AWS will automatically scale the resources as the number of the calls to the Lambda function increases. We should be able to allocate the amount of memory allocated to the Lambda function and the Lambda function is automatically allocated the proportional CPU. Here is the FAQ on Lambda.

Amazon has published a nice article here on the how create a Lambda function which gets triggered by an image in S3 and then automatically creates a Thumbnail of the same again in a different bucket in S3. The same can be used for a photo sharing site like Flicker, Picassa etc. The article is in detail, but there are a lot steps involved using the CLI (Command Line Interface) which is not a piece of cake for those who are just starting with the AWS service. Here in this blog we will look at the sequence of steps using the AWS Web Management Console for the same.


Sequence of steps for creating an AWS Lambda function

Here we go with the assumptions that the Eclipse and Java 8 SDK setup have already been done on the system and that the participants have already have an account created with AWS. For the sake of this article the IAM, S3 and Lambda resources consumed fall under the AWS free tier.

Step 1: Start the Eclipse. Go to 'File -> New -> Project ...' and choose 'Maven Project' and click on Next.


Step 2: Choose to create a a simple project and click on Next.

creating maven project in eclipse

Step 3: Type the following artifact information and click on Finish.

Group Id: doc-examples
Artifact Id: lambda-java-example
Version: 0.0.1-SNAPSHOT
Packaging: jar
Name: lambda-java-example

creating maven project in eclipse

The project will be created and the Package Explorer will be as below.

maven project structure in eclipse

Step 4: Replace the pom.xml content with the below content. This Maven file has all the dependencies and the plugins used in this project.

<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
  <modelVersion>4.0.0</modelVersion>
  <groupId>doc-examples</groupId>
  <artifactId>lambda-java-example</artifactId>
  <version>0.0.1-SNAPSHOT</version>
  <name>lambda-java-example</name>
  <dependencies>
   <dependency>
    <groupId>com.amazonaws</groupId>
    <artifactId>aws-lambda-java-core</artifactId>
    <version>1.1.0</version>
   </dependency>
   <dependency>
    <groupId>com.amazonaws</groupId>
    <artifactId>aws-lambda-java-events</artifactId>
    <version>1.3.0</version>
   </dependency>
  </dependencies>
  <build>
   <plugins>
    <plugin>
     <groupId>org.apache.maven.plugins</groupId>
     <artifactId>maven-shade-plugin</artifactId>
     <version>2.3</version>
    </plugin>
   </plugins>
  </build>
</project> 

Step 5: Create the example package and then add the S3EventProcessorCreateThumbnail java code from here in the S3EventProcessorCreateThumbnail.java file.

maven project structure in eclipse

Step 6: Now, it's time to build and package the project. Right click on the project in the Package explorer view and then go to 'Run As -> Maven build ....'. Enter 'package' in the Goals as shown below and then click Run.

packaing the code using maven

Once the maven build is complete, then the BUILD SUCCESS should be shown in the bottom right console. And also the jar should appear in the target folder after refreshing the project.

jar file in eclipse

Step 7: Build the project again using the above step with the Goal as 'package shade:shade'. Make sure to change the name of the Maven configuration to something else than the previous name as shown below.

packaing the code using maven

Once the Maven build is complete, then the BUILD SUCCESS should be shown in the bottom right console. And also the jar should appear in the target folder after refreshing the project.

jar file in eclipse

Step 8: An IAM role has to be created and attached to the Lambda function, so that it can access the appropriate AWS resources. Goto the IAM AWS Management Console. Click on Roles and 'Create new role'.

creating an iam role

Step 9: Select the AWS Lambda role type.

creating an iam role

Step 10: Filter for the AWSLambdaExecute policy and select the same.

creating an iam role

Step 11: Give the role a name and click 'Create role'.

creating an iam role

The role will be created as shown below. The same role will be attached to the Lambda function later.

creating an iam role

Step 12: Goto the Lambda AWS Management Console and click on 'Create a function' and then select 'Author from scratch'.

creating a lambda function

creating a lambda function

Step 13: A trigger can be added to the S3 bucket later, Click on Next.

creating a lambda function

Step 14: Specify the below for the Lambda details and click on Next.
  • Function name as 'CreateThumbnail'
  • Runtime as 'Java 8'
  • Upload the lambda-java-example-0.0.1-SNAPSHOT.jar file from the target Eclipse folder
  • Handler name as example.S3EventProcessorCreateThumbnail::handleRequest
  • Choose the role which has been created in the IAM

Step 15: Verify the details and click on 'Create function'.

creating a lambda function

Within a few seconds the success screen should be shown as below.

creating a lambda function

Clicking on the Functions link on the left will show the list of all the functions uploaded to Lambda from this account.

creating a lambda function

Now that the Lambda function has been created, it's time to create buckets in S3 and link the source bucket to the Lambda function.

Step 16: Go to the S3 AWS Management Console and create the source and target buckets. The name of the target bucket should be the source bucket name appended with the word resized. The logic for the same has been hard coded in the Java code. There is no need to add the airline-dataset bucket, this is a bucket was already there in S3.

creating buckets in s3

Step 17: Click on the S3 source bucket and then properties to associate it with the Lambda function created earlier as shown below.



s3 attaching the lambda function

s3 attaching the lambda function

Step 18: Upload the image to the source bucket.

image in the source bucket

If everything goes well then the Thumbnail image should be in the target bucket in a few seconds. Note that the size of the resized Thumbnail image is smaller than the original image.

image in the target bucket

Step 19: The log files for the Java Lambda function can be found in the CloudWatch AWS Management Console as shown below. If for some reason, the resized image is not there in the target S3 folder, the reason for the same can be found in CloudWatch logs.

cloudwatch logs

cloudwatch logs

cloudwatch logs

Step 20: Few metrics can also be got from the AWS Lambda Management Console like the number of invocation count and duration.

cloudwatch metrics

 

Conclusion

Few things can be automated using the AWS Toolkit for Eclipse and the serverless.com framework. But, they hide most of the details and so it's better to follow the above sequence of steps to know what happens behind the scenes in the AWS Lambda service. In the future blogs, we will also explore on how to do the same with AWS Toolkit for Eclipse and the serverless.com framework also.

The AWS Lambda function can be fronted by AWS API Gateway which provides a REST based interface to create, publish, maintain, monitor, and secure APIs at any scale. This is again a topic for a future blog post.

Serverless Architecture would be future as it removes the burden about the infrastructure from the developer and moves it to the Cloud vendor like AWS. DynamoDB also falls under the same category. While creating the DynamoDB table, we simply specify the Read and the Write Capacity Units and Amazon will automatically provision the appropriate resources as in the case of Lambda.

This is one of the lengthiest blog I have written and really enjoyed it. Plan to write more such detailed blogs in the future.

Monday, July 3, 2017

Accessing the EMR Web Consoles

In the previous blog, we looked on how to start a AWS EMR cluster and run a Hive Script. Once the cluster has been started, it does provide a web console to check the status of the cluster and also to see the progress of the different data processing tasks. By default, the web consoles are blocked for the sake of security.

Below are the URLs of some of the web consoles.
YARN ResourceManager  http://master-public-dns-name:8088/
Hadoop HDFS NameNode  http://master-public-dns-name:50070/
Spark HistoryServer  http://master-public-dns-name:18080/
Zeppelin   http://master-public-dns-name:8890/
Hue    http://master-public-dns-name:8888/
Ganglia   http://master-public-dns-name/ganglia/
HBase UI   http://master-public-dns-name:16010/

YARN NodeManager  http://slave-public-dns-name:8042/
Hadoop HDFS DataNode  http://slave-public-dns-name:50075/
In this blog, we will be exploring on how to access the web consoles. The AWS documentation for the same is here.

Step 1 : Start the EMR cluster as shown in the previous blog.

Step 2 : Setup a ssh tunnel to the master using local port forwarding using the below command. Here the local port 8157 is being forwarded to the remote port 8088. The port 8157 can be replaced by any free local port and 8088 is the port on which the YARN console is available. Port 8088 can be replaced by the port of the Web console which we want to access.
ssh -i /home/praveen/Documents/AWS-Keys/MyKeyPair.pem -N -L 8157:ec2-54-147-238-2.compute-1.amazonaws.com:8088 hadoop@ec2-54-147-238-2.compute-1.amazonaws.com

In the above command replace the following

a) the path of the key pair
b) the DNS name of the master node (twice)


Step 3 : Access the YARN console from the browser from the same machine the Step 1 has been performed.


An alternate for the above step is to make changes to the master Security Group and allow inbound 8088, which is the YARN Web Console port number.

Tuesday, June 27, 2017

Processing the Airline dataset with AWS Athena

AWS Athena is a interactive query engine to process the data in S3. Athena is based on Presto which was developed by Facebook and then open sourced. With Athena there is no need to start a cluster, spawn EC2 instances. Simply create a table, point it to the data in S3 and run the queries.

In the previous blog, we looked at converting the Airline dataset from the original csv format to the columnar format and then run SQL queries on the two data sets using Hive/EMR combination. In this blog we will process the same data sets using Athena. So, here are the steps.

Step 1 : Go to the Athena Query Editor and create the ontime and the ontime_parquet_snappy table as shown below. The DDL queries for creating these two tables can be got from this blog.



Step 2 : Run the query on the ontime and the ontime_parquet_snappy table as shown below. Again the queries can be got from the blog mentioned in Step 1.



Note that, for processing the csv data it took 3.56 seconds and 2.14 GB of S3 data was scanned. For processing the Parquet Snappy data it took 3.07 seconds and 46.21 MB of S3 data was scanned.

There is not a significant time difference running the queries on the two datasets. But, Athena pricing is based on the amount of data scanned in the S3. So, the cost is significantly less to process the Parquet Snappy data than the csv data.

Step 3 : Go to the Catalog Manager and drop the tables. Dropping them will simply delete the table definition, but not associated data in S3.


Just out of curiosity I created the two tables again and run a different query this time. Below are the queries with the metrics.
select distinct(origin) from ontime_parquet_snappy;
Run time: 2.33 seconds, Data scanned: 4.76MB

select distinct(origin) from ontime;
Run time: 1.93 seconds, Data scanned: 2.14GB

As usual the there is not much difference in the time taken for the query execution, but the amount of data scanned in S3 for the Parquet Snappy data is significantly lower. So, the cost to run the query on the Parquet Snappy format data is significantly less.

Friday, June 23, 2017

Algorithmia - a store for algorithms, models and functions

I came across Algorithmia a few months back and didn't get a change to try it out. Again it came into focus with a Series A funding of $10.5M. More about the funding here.

Algorithmia  is a place where algorithms, models or functions can be discovered and be used for credits which we can buy. We get 5,000 credits every month for free. For example if a model costs 20 credits, then it can be called 250 times a month.

Create a free account here and get the API key from the profile. Now we should be able to call the different models using different languages like Python, Java, R and commands like curl. Below are the curl commands to do a sentimental analysis on a sentence. Make sure to replace the API_KEY with your own key.

curl -X POST -d '{"sentence": "I really like this website called algorithmia"}' -H 'Content-Type: application/json' -H 'Authorization: Simple API_KEY' https://api.algorithmia.com/v1/algo/nlp/SocialSentimentAnalysis

{"result":[{"compound":0.4201,"negative":0,"neutral":0.642,"positive":0.358,"sentence":"I really like this website called algorithmia"}],"metadata":{"content_type":"json","duration":0.010212005}}

curl -X POST -d '{"sentence": "I really dont like this website called algorithmia"}' -H 'Content-Type: application/json' -H 'Authorization: Simple API_KEY' https://api.algorithmia.com/v1/algo/nlp/SocialSentimentAnalysis

{"result":[{"compound":-0.3374,"negative":0.285,"neutral":0.715,"positive":0,"sentence":"I really dont like this website called algorithmia"}],"metadata":{"content_type":"json","duration":0.009965723}}
Algorithmia is more like a Google Play Store and Apple App Store, where individuals and companies can upload mobile applications and rest of us can download the same. It's an attempt to democratize Artificial Intelligence and Machine Learning.

Here is a service to convert the black and white to color images.

Monday, June 19, 2017

Converting Airline dataset from the row format to columnar format using AWS EMR

To process Big Data huge number of machines are required. Instead of buying them, it's better to process the data in the Cloud as it provides lower CAPEX and OPEX costs. In this blog we will at processing the airline data set in the AWS EMR (Elastic MapReduce). EMR provides Big Data as a service. We don't need to worry about installing, configuring, patching, security aspects of the Big Data software. EMR takes care of them, just we need specify the size and the number of the machines in the cluster, the location of the input/output data and finally the program to run. It's as easy as this.

The Airline dataset is in a csv format which is efficient for fetching the data in a row wise format based on some condition. But, not really efficient when we want to do some aggregations. So, we would be converting the CSV data into Parquet format and then run the same queries on the csv and Parquet format to observe the performance improvements.

Note that using the AWS EMR will incur cost and doesn't fall under the AWS free tier as we would be launching not the t2.micro EC2 instances, but a bit bigger EC2 instances. I will try to keep the cost to the minimum as this is a demo. Also, I prepared the required scripts ahead and tested them in the local machine on small data sets instead of the AWS EMR. This will save the AWS expenses.

So, here are the steps

Step 1 : Download the Airline data set from here and uncompress the same. All the data sets can be downloaded and uncompressed. But, to keep the cost to the minimum I downloaded the 1987, 1989, 1991, 1993 and 2007 related data and uploaded to S3 as shown below.



Step 2 : Create a folder called scripts and upload them to S3.


The '1-create-tables-move-data.sql' script will create the ontime and the ontime_parquet_snappy table, map the data to the table and finally move the data from the ontime table to the ontime_parquet_snappy table after transforming the data from the csv to the Parquet format. Below is the SQL for the same.
create external table ontime (
  Year INT,
  Month INT,
  DayofMonth INT,
  DayOfWeek INT,
  DepTime  INT,
  CRSDepTime INT,
  ArrTime INT,
  CRSArrTime INT,
  UniqueCarrier STRING,
  FlightNum INT,
  TailNum STRING,
  ActualElapsedTime INT,
  CRSElapsedTime INT,
  AirTime INT,
  ArrDelay INT,
  DepDelay INT,
  Origin STRING,
  Dest STRING,
  Distance INT,
  TaxiIn INT,
  TaxiOut INT,
  Cancelled INT,
  CancellationCode STRING,
  Diverted STRING,
  CarrierDelay INT,
  WeatherDelay INT,
  NASDelay INT,
  SecurityDelay INT,
  LateAircraftDelay INT
) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LOCATION 's3://airline-dataset/airline-csv/';

create external table ontime_parquet_snappy (
  Year INT,
  Month INT,
  DayofMonth INT,
  DayOfWeek INT,
  DepTime  INT,
  CRSDepTime INT,
  ArrTime INT,
  CRSArrTime INT,
  UniqueCarrier STRING,
  FlightNum INT,
  TailNum STRING,
  ActualElapsedTime INT,
  CRSElapsedTime INT,
  AirTime INT,
  ArrDelay INT,
  DepDelay INT,
  Origin STRING,
  Dest STRING,
  Distance INT,
  TaxiIn INT,
  TaxiOut INT,
  Cancelled INT,
  CancellationCode STRING,
  Diverted STRING,
  CarrierDelay INT,
  WeatherDelay INT,
  NASDelay INT,
  SecurityDelay INT,
  LateAircraftDelay INT
) STORED AS PARQUET LOCATION 's3://airline-dataset/airline-parquet-snappy/' TBLPROPERTIES ("orc.compress"="SNAPPY");

INSERT OVERWRITE TABLE ontime_parquet_snappy SELECT * FROM ontime;
The '2-run-queries-csv.sql' script will run the query on the ontime table which maps to the csv data. Below is the query.
INSERT OVERWRITE DIRECTORY 's3://airline-dataset/csv-query-output' select Origin, count(*) from ontime where DepTime > CRSDepTime group by Origin;
The '3-run-queries-parquet.sql' script will run the query on the ontime_parquet_snappy table which maps to the Parquet-Snappy data. Below is the query.
INSERT OVERWRITE DIRECTORY 's3://airline-dataset/parquet-snappy-query-output' select Origin, count(*) from ontime_parquet_snappy where DepTime > CRSDepTime group by Origin;
Step 3 : Goto the EMR management console and click on the 'Go to advanced options'.


Step 4 : Here select the software to be installed on the instances. For this blog we need Hadoop 2.7.3 and Hive 2.1.1. Make sure these are selected, the rest are optional. Here we can add a step. According to the AWS documentation, this is the definition of Step - 'Each step is a unit of work that contains instructions to manipulate data for processing by software installed on the cluster.'. This can be a MR program, Hive Query, Pig Script or something else. The steps can be added here or later. We will add steps later. Click on Next.


Step 5 : In this step, we can select the number of instances we want to run and the size of each instance. We will leave them as default and click on next.


Step 6 : In this step, we can select additional settings like the cluster name, the S3 log path location and so on. Make sure the 'S3 folder' points to the log folder in S3 and click on Next. Uncheck the 'Termination protection' option.


Step 7 : In this screen again all the default options are good enough. If we want to ssh into the EC2 instances then the 'EC2 key pair' has to be selected. Here are the instructions on how to create a key pair. Finally click on 'Create cluster' to launch the cluster.


Initially the cluster will be in a Starting state and the EC2 instances will be launched as shown below.



Within a few minutes, the cluster will be in a running state and the Steps (data processing programs) can be added to the cluster.


Step 8 : Add a Step to the cluster by clicking on the 'Add step' and pointing to the '1-create-tables-move-data.sql' file as shown below and click on Add. The processing will start on the cluster.



The Step will be in a Pending status for some time and then move to the Completed status after the processing has been done.



Once the processing has been complete the csv data will be converted into a Parquet format with Snappy compression and put into S3 as shown below.


Note that the csv data was close to 2,192 MB and the Parquet Snappy data is around 190 MB. The Parquet data is in columnar format and provides higher compression compared to the csv format. This enables to fit more data into the memory and so quicker results.

Step 9 : Now add 2 more steps using the '2-run-queries-csv.sql' and the '3-run-queries-parquet.sql'. The first sql file will run the query on the csv data table and the second will run the query on the Parquet Snappy table. Both the queries are the same, returning the same results in S3.

Step 10 : Check the step log files for the steps to get the execution times in the EMR management console.

Converting the CSV to Parquet Snappy format - 148 seconds
Executing the query on the csv data - 90 seconds
Executing the query on the Parquet Snappy data - 56 seconds

Note that the query runs faster on the Parquet Snappy data, when compared to the csv data. I was expecting the query to run a bit faster, need to look into this a bit more.

Step 11 : Now that the processing has been done, it's time to terminate the cluster. Click on Terminate and again on Terminate. It will take a few minutes for the cluster to terminate.


Note that the EMR cluster will be terminated and EC2 instances will also terminate.



Step 12 : Go back to S3 management console the below folders should be there. Clean up by deleteing the bucket. I would be keeping the data, so that I can try Athena and RedShift on the CSV and the Parquet Snappy data. Note that 5GB of S3 data can be stored for free upto one year. More details about the AWS free tier here.


In the future blogs, we will look at processing the same data using AWS Athena. With Athena there is no need to spawn a cluster, so the serverless model. AWS Athena will automatically spawn the servers. We simply create a table, map it to the data in S3 and run the SQL queries on it.

With the EMR the pricing is rounded to the hour and for executing a query about 1 hour and 5 minutes, we need to pay for complete 2 hours. With Athena we pay by the amount of the data scanned. So, changing the data into a columnar format and compressing it, will not only make the query run a bit faster, but also cut down the bill.

UpdateHere and here are articles from the AWS documentation on the same. It has got some additional commands.