Big Data and Cloud Tips: 2017

Thursday, November 2, 2017

Refining the existing AWS Security Groups

I am a big fan of blogs/articles which use multiple services from AWS. Each of the services from AWS is powerful, but when we combine them in different ways we can achieve a lot more.

When an organization deploys an application in the Cloud, over time there can be some port numbers in the Security Groups which are not required for the functionality of the application. These unnecessary ports might be a security risk to the organization. So, it's always better to open the minimum set of port numbers required.

AWS doesn't give a direct way to identify the unused ports, the VPC flow logs have to be captured and analyzed to identify the unused port numbers and the corresponding Network Interfaces and Security Groups. So, below are two blogs from AWS on the same.

How to Optimize and Visualize Your Security Groups

How to Visualize and Refine Your Network’s Security by Adding Security Group IDs to Your VPC Flow Logs

The end results are the same for the two blogs, but they do use different services from AWS. The blogs are pretty straightforward to follow. After trying it out, you would be familiar with VPC flow logs, Lambda, Kinesis, Elasticsearch and IAM.

For those who are getting started with AWS, I would definitely recommend going through the above blogs in the same order. Depending on the technology comfort, it might take time, but the blogs are worth trying it out.

Tuesday, October 31, 2017

Managing the Raspberry Pi from Laptop

Pi (short for Raspberry Pi) is a single-board computer (SBC). It has a couple of USB ports to which keyboard, mouse and other peripherals can be connected. An HDMI port to connect a monitor, a MicroSD slot for OS/applications/data and a mini USB port for power supply. There are a bunch of models in Pi and I have bought Raspberry Pi 2 Model B a few years back and plugged the different components together as shown below. Finally, installed Raspbian on it. Note that the green casing is not part of the Pi, it had to be ordered additionally.

The right side cables are to the USB ports to which a mouse and keyboard are connected. On top of the USB cable is the WiFi dongle. The later models of Pi have inbuilt support for WiFi, but this model doesn't have. The left side cables are HDMI and power supply. It's a cool setup for the kids to get started with computers. It's easy to setup, but those who are scared there are a few choices of laptops built on pi as the new pi-top. The Pi has a few GPIO ports to which different sensors (light, temperature, humidity etc) can be connected to get the ambiance conditions and take some actions using actuators.

The above configuration is cool, but is not mobile as it's much like a desktop. So, I was looking for options to connect it to the Laptop and use the keyboard, mouse, monitor and power from the laptop. Here I got the instructions for the same. Finally. I was able to manage the Pi from the laptop as shown below. The Pi Desktop is there on the Laptop. Not sure why, but the VNC server on the Pi stopped starting automatically. So, I had to login via ssh to the Pi and start the VNC server.

Installing WordPress on AWS Cloud

In this blog, we would be installing WordPress which is a popular CMS (Content Management System) on EC2 and RDS. The WordPress software is widely used to create blogs and web sites. This configuration is not recommended for production. At the end of the blog, the additional tasks to be done to make the entire thing a bit more robust would be mentioned. The end product is as below.

The WordPress software would be running on a Ubuntu EC2 instances and the data would be stored in MySQL RDS instance. Starting the RDS takes time, so first we would start the RDS instance and then the EC2 instance with WordPress on it.

Amazon Macie and S3 Security

AWS S3 got into limelight lately for wrong reasons, more here (1, 2, 3). S3 security policies are a pain in the neck to understand, we will cover about security in the context of S3 in a detailed blog later. Before the cloud was there it took a few days to weeks for procuring the hardware, software, setting them etc. But with the cloud, it takes a few minutes to create an S3 bucket, put some sensitive data and finally set some wrong permissions on them.

Meanwhile, AWS launched Macie to protect sensitive data from getting into the wrong hands. Here is the blog from AWS launching Macie and here the documentation on how to get started with Macie. The blog explains nicely on how to get started with Macie. Also, look at the Macy FAQ here. Initially, Macie covers only S3 data, the plan is to roll Macie for other services like EC2.

Wednesday, October 18, 2017

Getting notified for any state change in the AWS resources (like EC2 going down)

EC2 can be used for any purpose like running a website, doing some batch processing. A website has a requirement to run 99.9 or 99.99 or some other percentage of the time, so back up of the EC2 instances are required for the sake of high availability. But, lets take the case of batch as in the case of transforming 1000's of records from one format to another, then high availability is not really important. If an instance fails then another can be started or the work can be automatically shifted to some other instance automatically as in the case of Big Data.

Just to quickly summarize, in the case as in the case of web servers we need some level of high availability and so multiple EC2 instances (backup), but in the case of batch processing there is no need of backup. Lets take the case of a batch job running on a single EC2 instance, it would be good to get some sort of notification when the instance goes down. We would be looking into the same in this blog. We would be using EC2, SNS and CloudWatch. So, it would be a good way to get familiar with the different topics.

So, here are the steps.

Step 1: Create a new topic in the SNS management console.

Microsoft Azure for AWS Professionals

A few months back I blogged about 'GCP for AWS Professionals' comparing the two platforms here. Now, Microsoft has published something similar comparing Microsoft Azure with Amazon AWS here.

It's good to know for Amazon AWS when their competitors are comparing their services with Amazon's. AWS has been adding new services and features (small and big) within them at a very rapid pace. Here you can get the new features introduced in Amazon AWS on a daily basis.

Similar to GCP and AWS, Azure also gives free credit to get started. So, now is the time to create an account and get started with Cloud. So, here are the links for the same (AWS, Azure and GCP).

Getting the execution times of the EMR Steps from the CLI

In the previous blog, we executed a Hive script to convert the Airline dataset from the original csv format to Parquet Snappy format. And then same query were run to csv and the Parquet Snappy format data to see the performance improvements. This involved three steps.

Step 1 : Create the ontime and the ontime_parquet_snappy table. Move the data from ontime table to the ontime_parquet_snappy table for the conversion of one format to another.

Step 2 : Execute the query on the ontime table, which represents the csv data.

Step 3 : Execute the query on the ontime_parquet_snappy time, which representa the Parquet Snappy data.

The execution time for the above three steps was got from the AWS EMR management console which is a Web UI. All the tasks which can be done from the AWS management console can also be done from the CLI (Command Line Interface) also. Lets see the steps involved to get the execution time for the steps in EMR.

Different ways of executing the Big Data processing jobs in EMR

There are different ways of kick starting a Hive/Pig/MR/Spark on Amazon EMR. We already looked at how to submit a Hive job or a step from the AWS EMR management console here. This approach is cool, but doesn't have much scope for automation.

Here are the other ways to start the Big Data Processing with some level of automation.

1) Use Apache Oozie to create a workflow and a coordinator.
2) Use the AWS CLI
3) Login to the master instance and use the Hive shell

In the above, Option 1 is a bit complicated and will be explored in another blog. Here we will be looking at the other two options.

Option 2 : Using the AWS CLI

Step 1 : Create the airline.sql with the below content. The below will create a table in Hive and map it to the data in S3. To get the data into S3 follow this article. Then a query will be run on the table.

create external table ontime_parquet_snappy (
  Year INT,
  Month INT,
  DayofMonth INT,
  DayOfWeek INT,
  DepTime  INT,
  CRSDepTime INT,
  ArrTime INT,
  CRSArrTime INT,
  UniqueCarrier STRING,
  FlightNum INT,
  TailNum STRING,
  ActualElapsedTime INT,
  CRSElapsedTime INT,
  AirTime INT,
  ArrDelay INT,
  DepDelay INT,
  Origin STRING,
  Dest STRING,
  Distance INT,
  TaxiIn INT,
  TaxiOut INT,
  Cancelled INT,
  CancellationCode STRING,
  Diverted STRING,
  CarrierDelay INT,
  WeatherDelay INT,
  NASDelay INT,
  SecurityDelay INT,
  LateAircraftDelay INT
) STORED AS PARQUET LOCATION 's3://airline-dataset/airline-parquet-snappy/' TBLPROPERTIES ("orc.compress"="SNAPPY");

INSERT OVERWRITE DIRECTORY 's3://airline-dataset/parquet-snappy-query-output' select Origin, count(*) from ontime_parquet_snappy where DepTime > CRSDepTime group by Origin;

EMR logging into the master instance

Once we spawn a Cluster as mentioned here, we should see the instances in the EC2 management console. It would be nice to login to the master instance. All the log files are generated on the master and then moved to S3. Also, the different Big Data processing jobs can be run from the master command line interface.

In this blog we will look into connecting to the master. The AWS documentation for the same is here.

Step 1 : Click on the gear button on the top right. The columns in the page can be added or deleted here.

Include the EMR related keys as shown in the right of the above screen shot and the EC2 instance roles (MASTER and CORE) will be displayed as shown below.

Get the DNS hostname of the master instance after select it.

How to get familiar with the different AWS Services quickly?

AWS has got a lot of services and they are introducing new services and a lot of features within them at a very quick pace. It's a difficult task to get in pace with them. New features (small and big) are introduced almost. daily. Here is a blog to get updated on the latest services and features in AWS across different services. In this blog, you will notice that almost every day there is something new.

AWS documentation comes with 'Getting Started' guides/tutorials as the name says to get started with the different AWS Service quickly and don't go into too much of detail. For those who want to become an AWS Architect, an understanding of the different AWS Services is quite essential and these 'Getting Started' guides/tutorials are helpful for the same.

The 'Getting Started' guides/tutorials for different AWS Services have a different URL pattern and so is difficult to figure out. So, a quick Google search with the below URL will find out all the AWS 'Getting Started' guides/tutorials in the AWS documentation for the different services. Click on the Next in the search page to get a few more of them.

https://www.google.co.in/search?q=getting+started+site%3Aaws.amazon.com

Again, I would strongly recommend going through the above 'Getting Started' guides/tutorials for the wannabe AWS Architects. Hope it helps.

Saturday, October 14, 2017

aws-summary : Different ways of replicating data in S3

S3 has different storage classes with different durability and availability as mentioned here. S3 provides very high durability and availability, but if more is required then CRR (Cross Region Replication) can be used. CRR as the name, replicates the data automatically across buckets in two different regions to provide even more durability and availability.

Here are a few resources around CRR.

About and using CRR

Monitoring the CRR

The below two approaches are for replicating S3 data within the same region which CRR doesn't support.

Content Replication Using AWS Lambda and Amazon S3

Synchronizing Amazon S3 Buckets Using AWS Step Functions

Nothing new here, but a way of consolidating resources around replication of S3 data. AWS has a lot of good resources to get started and also with advanced topics, but they are dispersed and difficult to find out. I will updating this blog as I find new resources around this topic and also based on the comments in this blog.

Also, the above articles/blogs use multiple services from AWS, so it would be a nice way to get familiar with them.

Thursday, October 12, 2017

Is it on AWS?

I am really fascinated by Serverless model. There is no need to think in terms of the servers, no need to scale, cheap when compared to server hosting etc etc. I agree there are some issues like cold start of the containers when the Lambda or in fact any any serverless function is invoked after a long time.

I came across this 'Is it on AWS?' a few months back and bookmarked it, but didn't get a chance to try it out. It uses a Lambda function to tell if a particular domain name or ip address is in the is in the published list of AWS IP address ranges. The site has links to a blog and code for the same, so would not like to expand it here.

Have fun reading the blog !!!!

Creating an Application Load Balancer and querying it's logging data using Athena

When building a highly scalable website like amazon.com, there would be thousands of web servers and all of them would be fronted by multiple load balancers as shown below. The end user would be pointing to the load balancer which would forward the requests to the web servers. In the case of the AWS ELB (Elastic Load Balancer), the distribution of the traffic from load balancer to the servers is in a round-robin fashion and doesn't consider the size of the server or how busy/idle the servers are. May be AWS will add this feature in the upcoming releases.

In this blog, we would be analyzing the number of users coming to a website from different ip addresses. Here are the steps at a high level which we would be exploring in a bit more detail. This is again a lengthy post where would be using a couple of AWS services (ELB, EC2, S3 and Athena) and see how they work together.

    - Create two Linux EC2 instance with web servers with different content
    - Create an Application Load Balancer and forward the requests to the above web servers
    - Enable the logging on the Application Load Balancer to S3
    - Analyze the logging data using Athena

To continue further, the following can be done (not covered in this article)

    - Create a Lambda function to call the Athena query at regular intervals
    - Auto Scale the EC2 instances depending on the resource utilization
    - Remove the Load Balancer data from s3 after a certain duration

Step 1: Create two Linux instances and install web servers as mentioned in this blog. In the /var/www/html folder have the files as mentioned below. Ports 22 and 80 have to be opened for accessing the instance through ssh and for accessing the web pages in the browser.

     server1 - index.html
     server2 - index.hml and img/someimage.png

Make sure that ip-server1, ip-server2 and ip-server2/img/someimage.png are accessible from the web browser. Note that the image should be present in the img folder. The index.html is for serving the web pages and also for the health check, while the image is for serving the web pages.

Step 2: Create the Target Group.

Step 3: Attach the EC2 instances to the Target Group.

Step 4: Change the Target Group's health checks. This will make the instances healthy faster.

Step 5: Create the second Target Group. Associate server2 with the target-group2 as mentioned in the flow diagram.

Step 6: Now is the time to create the Application Load Balancer. This balancer is relatively new when compared to the Classic Load Balancer. Here is there difference between the different Load Balancers. The Application Load Balancer operates at the layer 7 of the OSI and supports host-based and path-based routing. Any web requests with '/img/*' pattern would be sent to the target-group2, rest by default would be sent to target-group1 after completing the below settings.

Step 7: Associate the target-group1 with the Load Balancer, the target-group2 will be associated later.

Step 8: Enable access logs on the Load Balancer by editing the attributes. The specified S3 bucket for storing the logs will be automatically created.

Few minutes after the Load Balancer has been created, the instances should turn into a healthy state as shown below. If not, then maybe one of the above steps has been missed.

Step 9: Get the DNS name of the Load Balancer and open it in the browser to make sure that the Load Balancer is working.

Step 10: Now is the time to associate the second Target Group (target-group2). Click on View/edit rules) and add a rule.

Any requests with the path /img/* would be sent to the target-group2, rest of them would be redirected to the target-group2.

Step 11: Once the Load Balancer has been accessed from different browsers a couple of times, the log files should be generated in S3 as shown below.

Step 12: Now it's time to create tables in Athena and then map it to the data in S3 and query the tables. The DDL and the DML commands for Athena can be found here.

We have seen how to create a Load Balancer, associate Linux web servers with them and finally check how to query the log data with Athena. Make sure that all the AWS resources which have been created are deleted to stop the billing for them.

That's it for now.

Note : AWS introduced ALB with Weighted Target Groups (1). This can be used to distribute unevenly the traffic across multiple Target Groups, instead of the default round-robin. This can be used for BlueGreen and Canary deployments.

Wednesday, October 11, 2017

Support for other languages in AWS Lambda

According to the AWS Lambda FAQ.

Q: What languages does AWS Lambda support?

AWS Lambda supports code written in Node.js (JavaScript), Python, Java (Java 8 compatible), and C# (.NET Core). Your code can include existing libraries, even native ones. Please read our documentation on using Node.js, Python, Java, and C#.

So, I was under the impression that AWS Lambda only supports the languages mentioned in the FAQ documentation. But, other languages are also supported in Lambda with some effort. Here is the AWS blog on the same. Basically it's a Node.js wrapper for invoking the program in one of the language which is not supported by default by AWS Lambda. There would be an overhead with the above approach, but not sure how much.

Monday, October 9, 2017

Automating EBS Snapshot creation

In one of the previous blog, we looked about attaching an EBS Volume to a Linux EC2 instance. You can think of EBS Volume has a hard disk, multiple of which can be attached to the Linux EC2 instances for storing the data.

The EBS Volumes where the data is stored can be corrupted or there can be some sort of failure. So, it's better to take a Snapshot (backup) from the Volume at regular intervals depending upon the requirement. In case of any failures, the Volume can be created from the Snapshot as shown below. Here is the AWS documentation on how to create an Snapshot and here is the documentation for restoring the corrupted/failed Volume from a healthy Snapshot. Here is a good documentation on what Snapshot is all about.

The Snapshots can be created manually using the CLI and API, but it would be good to automate the creation of the Snapshots. There are multiple approaches for the same, which we will look into it.

Approach 1 : This is the easiest approach of all without any coding as mentioned here. This uses CloudWatch events and is not really flexible, it just gives the option to take Snapshots at regular intervals using Cron expression. Other that that it doesn't give much flexibility. Lets say we want to take Snapshots of EC2 instances which have been tagged for backup out of all the EC2 instances then this approach would not be good enough.

Approach 2 : This approach is a bit more flexible, but we need to code and need to be familiar with the AWS Lambda service. Java, Python, C# and Node.js can be uses as of now against the Lambda service. These articles (1, 2) give details on creating a Lambda function using Python and trigger it at regular intervals to create a Snapshot out of the Volume. The articles are a bit old, but the procedure is more of less the same. The articles mention about scheduling in the Lambda Management Console, but the scheduling has to be done from the CloudWatch Events as mentioned here.

Wednesday, October 4, 2017

AWS SES for sending emails using Java

We had been using emails for marketing some of the product offerings we have. Instead of creating our own email Server for sending emails, we had been using AWS SES for the same. In this blog, I would be posting the code (Java) and the configuration files to send emails in bulk using the AWS SES. Sending emails through SES is quite cheap, more details about the pricing here.

Here I am going with the assumption that the readers of this blog are a bit familiar with Java, AWS and Maven. They should also be having an account with AWS. If you don't have an AWS account, here are the steps.

Step 1: Login to the AWS SES Management Console and get the email address verified. This should be the same email address from which the emails will be sent. The AWS documentation for the same is here.

Step 2: Go to the AWS SQS Management Console and create a Queue. All the default setting should be good enough.

Step 3: Go to the AWS SNS Management Console and create a Topic. For this tutorial 'SESNotifications' topic has been created.

Step 4: Select the topic which has been created in the previous step and go to 'Actions -> Subscribe to topic'. The SQS Queue Endpoint (ARN) can be got from the previous step.

Step 5: Go to the SES Management Console and create a Configuration Set as shown below. More about the Configuration Sets here and here.

With the above configuration when an email is sent then any Bounce, Click, Complaint, Open events will be sent to the SNS topic and from there it will go to the SQS Queue. The above sequence completes the steps to be done in the AWS Management console. Now will look at the code and configuration files for sending the emails. The below files are required for sending the emails

File 1 : SendEmailViaSES.java - Uses the AWS Java SDK API to send emails.

package com.thecloudavenue;

import java.io.BufferedReader;
import java.io.DataInputStream;
import java.io.File;
import java.io.FileInputStream;
import java.io.InputStream;
import java.io.InputStreamReader;
import java.io.StringWriter;
import java.util.Properties;
import java.util.StringTokenizer;
import java.util.regex.Pattern;

import org.apache.velocity.Template;
import org.apache.velocity.VelocityContext;
import org.apache.velocity.app.Velocity;

import com.amazonaws.regions.Region;
import com.amazonaws.regions.Regions;
import com.amazonaws.services.simpleemail.AmazonSimpleEmailServiceClient;
import com.amazonaws.services.simpleemail.model.Body;
import com.amazonaws.services.simpleemail.model.Content;
import com.amazonaws.services.simpleemail.model.Destination;
import com.amazonaws.services.simpleemail.model.Message;
import com.amazonaws.services.simpleemail.model.SendEmailRequest;

public class SendEmailViaSES {

 public static void main(String[] args) {

  System.out.println("Attempting to send emails through Amazon SES .....");

  try {

   if (args.length == 0) {
    System.out.println(
      "Proper Usage is: java -cp jar_path com.thecloudavenue.SendEmailViaSES config.properties_path");
    System.exit(0);
   }

   File f = new File(args[0]);
   if (!f.isFile()) {
    System.out.println(
      "Please make sure that the config.properties is specified properly in the command line");
    System.exit(0);
   }

   System.out.println("\n\nLoading the config.properties file .....");
   Properties prop = new Properties();
   InputStream input = null;
   input = new FileInputStream(args[0]);
   prop.load(input);

   String template_file_path = new String(prop.getProperty("template_file_path"));
   String template_file_name = new String(prop.getProperty("template_file_name"));
   f = new File(template_file_path + "\\\\" + template_file_name);
   if (!f.isFile()) {
    System.out.println(
      "Please make sure that the template_file_path and  template_file_name are set proper in the config.properties");
    System.exit(0);
   }

   String email_db_path = new String(prop.getProperty("email_db_path"));
   f = new File(email_db_path);
   if (!f.isFile()) {
    System.out.println("Please make sure that the email_db_path is set proper in the config.properties");
    System.exit(0);
   }

   String from_address = new String(prop.getProperty("from_address"));
   String email_subject = new String(prop.getProperty("email_subject"));
   Long sleep_in_milliseconds = new Long(prop.getProperty("sleep_in_milliseconds"));
   String ses_configuration_set = new String(prop.getProperty("ses_configuration_set"));

   System.out.println("Setting the Velocity to read the template using the absolute path .....");
   Properties p = new Properties();
   p.setProperty("file.resource.loader.path", template_file_path);
   Velocity.init(p);

   AmazonSimpleEmailServiceClient client = new AmazonSimpleEmailServiceClient();
   Region REGION = Region.getRegion(Regions.US_EAST_1);
   client.setRegion(REGION);

   System.out.println("Getting the Velocity Template file .....");
   VelocityContext context = new VelocityContext();
   Template t = Velocity.getTemplate(template_file_name);

   System.out.println("Reading the email db file .....\n\n");
   FileInputStream fstream = new FileInputStream(email_db_path);
   DataInputStream in = new DataInputStream(fstream);
   BufferedReader br = new BufferedReader(new InputStreamReader(in));
   String strLine, destination_email, name;

   int count = 0;

   while ((strLine = br.readLine()) != null) {

    count++;

    // extract the email from the line
    StringTokenizer st = new StringTokenizer(strLine, ",");
    destination_email = st.nextElement().toString();

    // Check if the email is valid or not
    Pattern ptr = Pattern.compile(
      "(?:(?:\\r\\n)?[ \\t])*(?:(?:(?:[^()<>@,;:\\\\\".\\[\\] \\000-\\031]+(?:(?:(?:\\r\\n)?[ \\t])+|\\Z|(?=[\\[\"()<>@,;:\\\\\".\\[\\]]))|\"(?:[^\\\"\\r\\\\]|\\\\.|(?:(?:\\r\\n)?[ \\t]))*\"(?:(?:\\r\\n)?[ \\t])*)(?:\\.(?:(?:\\r\\n)?[ \\t])*(?:[^()<>@,;:\\\\\".\\[\\] \\000-\\031]+(?:(?:(?:\\r\\n)?[ \\t])+|\\Z|(?=[\\[\"()<>@,;:\\\\\".\\[\\]]))|\"(?:[^\\\"\\r\\\\]|\\\\.|(?:(?:\\r\\n)?[ \\t]))*\"(?:(?:\\r\\n)?[ \\t])*))*@(?:(?:\\r\\n)?[ \\t])*(?:[^()<>@,;:\\\\\".\\[\\] \\000-\\031]+(?:(?:(?:\\r\\n)?[ \\t])+|\\Z|(?=[\\[\"()<>@,;:\\\\\".\\[\\]]))|\\[([^\\[\\]\\r\\\\]|\\\\.)*\\](?:(?:\\r\\n)?[ \\t])*)(?:\\.(?:(?:\\r\\n)?[ \\t])*(?:[^()<>@,;:\\\\\".\\[\\] \\000-\\031]+(?:(?:(?:\\r\\n)?[ \\t])+|\\Z|(?=[\\[\"()<>@,;:\\\\\".\\[\\]]))|\\[([^\\[\\]\\r\\\\]|\\\\.)*\\](?:(?:\\r\\n)?[ \\t])*))*|(?:[^()<>@,;:\\\\\".\\[\\] \\000-\\031]+(?:(?:(?:\\r\\n)?[ \\t])+|\\Z|(?=[\\[\"()<>@,;:\\\\\".\\[\\]]))|\"(?:[^\\\"\\r\\\\]|\\\\.|(?:(?:\\r\\n)?[ \\t]))*\"(?:(?:\\r\\n)?[ \\t])*)*\\<(?:(?:\\r\\n)?[ \\t])*(?:@(?:[^()<>@,;:\\\\\".\\[\\] \\000-\\031]+(?:(?:(?:\\r\\n)?[ \\t])+|\\Z|(?=[\\[\"()<>@,;:\\\\\".\\[\\]]))|\\[([^\\[\\]\\r\\\\]|\\\\.)*\\](?:(?:\\r\\n)?[ \\t])*)(?:\\.(?:(?:\\r\\n)?[ \\t])*(?:[^()<>@,;:\\\\\".\\[\\] \\000-\\031]+(?:(?:(?:\\r\\n)?[ \\t])+|\\Z|(?=[\\[\"()<>@,;:\\\\\".\\[\\]]))|\\[([^\\[\\]\\r\\\\]|\\\\.)*\\](?:(?:\\r\\n)?[ \\t])*))*(?:,@(?:(?:\\r\\n)?[ \\t])*(?:[^()<>@,;:\\\\\".\\[\\] \\000-\\031]+(?:(?:(?:\\r\\n)?[ \\t])+|\\Z|(?=[\\[\"()<>@,;:\\\\\".\\[\\]]))|\\[([^\\[\\]\\r\\\\]|\\\\.)*\\](?:(?:\\r\\n)?[ \\t])*)(?:\\.(?:(?:\\r\\n)?[ \\t])*(?:[^()<>@,;:\\\\\".\\[\\] \\000-\\031]+(?:(?:(?:\\r\\n)?[ \\t])+|\\Z|(?=[\\[\"()<>@,;:\\\\\".\\[\\]]))|\\[([^\\[\\]\\r\\\\]|\\\\.)*\\](?:(?:\\r\\n)?[ \\t])*))*)*:(?:(?:\\r\\n)?[ \\t])*)?(?:[^()<>@,;:\\\\\".\\[\\] \\000-\\031]+(?:(?:(?:\\r\\n)?[ \\t])+|\\Z|(?=[\\[\"()<>@,;:\\\\\".\\[\\]]))|\"(?:[^\\\"\\r\\\\]|\\\\.|(?:(?:\\r\\n)?[ \\t]))*\"(?:(?:\\r\\n)?[ \\t])*)(?:\\.(?:(?:\\r\\n)?[ \\t])*(?:[^()<>@,;:\\\\\".\\[\\] \\000-\\031]+(?:(?:(?:\\r\\n)?[ \\t])+|\\Z|(?=[\\[\"()<>@,;:\\\\\".\\[\\]]))|\"(?:[^\\\"\\r\\\\]|\\\\.|(?:(?:\\r\\n)?[ \\t]))*\"(?:(?:\\r\\n)?[ \\t])*))*@(?:(?:\\r\\n)?[ \\t])*(?:[^()<>@,;:\\\\\".\\[\\] \\000-\\031]+(?:(?:(?:\\r\\n)?[ \\t])+|\\Z|(?=[\\[\"()<>@,;:\\\\\".\\[\\]]))|\\[([^\\[\\]\\r\\\\]|\\\\.)*\\](?:(?:\\r\\n)?[ \\t])*)(?:\\.(?:(?:\\r\\n)?[ \\t])*(?:[^()<>@,;:\\\\\".\\[\\] \\000-\\031]+(?:(?:(?:\\r\\n)?[ \\t])+|\\Z|(?=[\\[\"()<>@,;:\\\\\".\\[\\]]))|\\[([^\\[\\]\\r\\\\]|\\\\.)*\\](?:(?:\\r\\n)?[ \\t])*))*\\>(?:(?:\\r\\n)?[ \\t])*)|(?:[^()<>@,;:\\\\\".\\[\\] \\000-\\031]+(?:(?:(?:\\r\\n)?[ \\t])+|\\Z|(?=[\\[\"()<>@,;:\\\\\".\\[\\]]))|\"(?:[^\\\"\\r\\\\]|\\\\.|(?:(?:\\r\\n)?[ \\t]))*\"(?:(?:\\r\\n)?[ \\t])*)*:(?:(?:\\r\\n)?[ \\t])*(?:(?:(?:[^()<>@,;:\\\\\".\\[\\] \\000-\\031]+(?:(?:(?:\\r\\n)?[ \\t])+|\\Z|(?=[\\[\"()<>@,;:\\\\\".\\[\\]]))|\"(?:[^\\\"\\r\\\\]|\\\\.|(?:(?:\\r\\n)?[ \\t]))*\"(?:(?:\\r\\n)?[ \\t])*)(?:\\.(?:(?:\\r\\n)?[ \\t])*(?:[^()<>@,;:\\\\\".\\[\\] \\000-\\031]+(?:(?:(?:\\r\\n)?[ \\t])+|\\Z|(?=[\\[\"()<>@,;:\\\\\".\\[\\]]))|\"(?:[^\\\"\\r\\\\]|\\\\.|(?:(?:\\r\\n)?[ \\t]))*\"(?:(?:\\r\\n)?[ \\t])*))*@(?:(?:\\r\\n)?[ \\t])*(?:[^()<>@,;:\\\\\".\\[\\] \\000-\\031]+(?:(?:(?:\\r\\n)?[ \\t])+|\\Z|(?=[\\[\"()<>@,;:\\\\\".\\[\\]]))|\\[([^\\[\\]\\r\\\\]|\\\\.)*\\](?:(?:\\r\\n)?[ \\t])*)(?:\\.(?:(?:\\r\\n)?[ \\t])*(?:[^()<>@,;:\\\\\".\\[\\] \\000-\\031]+(?:(?:(?:\\r\\n)?[ \\t])+|\\Z|(?=[\\[\"()<>@,;:\\\\\".\\[\\]]))|\\[([^\\[\\]\\r\\\\]|\\\\.)*\\](?:(?:\\r\\n)?[ \\t])*))*|(?:[^()<>@,;:\\\\\".\\[\\] \\000-\\031]+(?:(?:(?:\\r\\n)?[ \\t])+|\\Z|(?=[\\[\"()<>@,;:\\\\\".\\[\\]]))|\"(?:[^\\\"\\r\\\\]|\\\\.|(?:(?:\\r\\n)?[ \\t]))*\"(?:(?:\\r\\n)?[ \\t])*)*\\<(?:(?:\\r\\n)?[ \\t])*(?:@(?:[^()<>@,;:\\\\\".\\[\\] \\000-\\031]+(?:(?:(?:\\r\\n)?[ \\t])+|\\Z|(?=[\\[\"()<>@,;:\\\\\".\\[\\]]))|\\[([^\\[\\]\\r\\\\]|\\\\.)*\\](?:(?:\\r\\n)?[ \\t])*)(?:\\.(?:(?:\\r\\n)?[ \\t])*(?:[^()<>@,;:\\\\\".\\[\\] \\000-\\031]+(?:(?:(?:\\r\\n)?[ \\t])+|\\Z|(?=[\\[\"()<>@,;:\\\\\".\\[\\]]))|\\[([^\\[\\]\\r\\\\]|\\\\.)*\\](?:(?:\\r\\n)?[ \\t])*))*(?:,@(?:(?:\\r\\n)?[ \\t])*(?:[^()<>@,;:\\\\\".\\[\\] \\000-\\031]+(?:(?:(?:\\r\\n)?[ \\t])+|\\Z|(?=[\\[\"()<>@,;:\\\\\".\\[\\]]))|\\[([^\\[\\]\\r\\\\]|\\\\.)*\\](?:(?:\\r\\n)?[ \\t])*)(?:\\.(?:(?:\\r\\n)?[ \\t])*(?:[^()<>@,;:\\\\\".\\[\\] \\000-\\031]+(?:(?:(?:\\r\\n)?[ \\t])+|\\Z|(?=[\\[\"()<>@,;:\\\\\".\\[\\]]))|\\[([^\\[\\]\\r\\\\]|\\\\.)*\\](?:(?:\\r\\n)?[ \\t])*))*)*:(?:(?:\\r\\n)?[ \\t])*)?(?:[^()<>@,;:\\\\\".\\[\\] \\000-\\031]+(?:(?:(?:\\r\\n)?[ \\t])+|\\Z|(?=[\\[\"()<>@,;:\\\\\".\\[\\]]))|\"(?:[^\\\"\\r\\\\]|\\\\.|(?:(?:\\r\\n)?[ \\t]))*\"(?:(?:\\r\\n)?[ \\t])*)(?:\\.(?:(?:\\r\\n)?[ \\t])*(?:[^()<>@,;:\\\\\".\\[\\] \\000-\\031]+(?:(?:(?:\\r\\n)?[ \\t])+|\\Z|(?=[\\[\"()<>@,;:\\\\\".\\[\\]]))|\"(?:[^\\\"\\r\\\\]|\\\\.|(?:(?:\\r\\n)?[ \\t]))*\"(?:(?:\\r\\n)?[ \\t])*))*@(?:(?:\\r\\n)?[ \\t])*(?:[^()<>@,;:\\\\\".\\[\\] \\000-\\031]+(?:(?:(?:\\r\\n)?[ \\t])+|\\Z|(?=[\\[\"()<>@,;:\\\\\".\\[\\]]))|\\[([^\\[\\]\\r\\\\]|\\\\.)*\\](?:(?:\\r\\n)?[ \\t])*)(?:\\.(?:(?:\\r\\n)?[ \\t])*(?:[^()<>@,;:\\\\\".\\[\\] \\000-\\031]+(?:(?:(?:\\r\\n)?[ \\t])+|\\Z|(?=[\\[\"()<>@,;:\\\\\".\\[\\]]))|\\[([^\\[\\]\\r\\\\]|\\\\.)*\\](?:(?:\\r\\n)?[ \\t])*))*\\>(?:(?:\\r\\n)?[ \\t])*)(?:,\\s*(?:(?:[^()<>@,;:\\\\\".\\[\\] \\000-\\031]+(?:(?:(?:\\r\\n)?[ \\t])+|\\Z|(?=[\\[\"()<>@,;:\\\\\".\\[\\]]))|\"(?:[^\\\"\\r\\\\]|\\\\.|(?:(?:\\r\\n)?[ \\t]))*\"(?:(?:\\r\\n)?[ \\t])*)(?:\\.(?:(?:\\r\\n)?[ \\t])*(?:[^()<>@,;:\\\\\".\\[\\] \\000-\\031]+(?:(?:(?:\\r\\n)?[ \\t])+|\\Z|(?=[\\[\"()<>@,;:\\\\\".\\[\\]]))|\"(?:[^\\\"\\r\\\\]|\\\\.|(?:(?:\\r\\n)?[ \\t]))*\"(?:(?:\\r\\n)?[ \\t])*))*@(?:(?:\\r\\n)?[ \\t])*(?:[^()<>@,;:\\\\\".\\[\\] \\000-\\031]+(?:(?:(?:\\r\\n)?[ \\t])+|\\Z|(?=[\\[\"()<>@,;:\\\\\".\\[\\]]))|\\[([^\\[\\]\\r\\\\]|\\\\.)*\\](?:(?:\\r\\n)?[ \\t])*)(?:\\.(?:(?:\\r\\n)?[ \\t])*(?:[^()<>@,;:\\\\\".\\[\\] \\000-\\031]+(?:(?:(?:\\r\\n)?[ \\t])+|\\Z|(?=[\\[\"()<>@,;:\\\\\".\\[\\]]))|\\[([^\\[\\]\\r\\\\]|\\\\.)*\\](?:(?:\\r\\n)?[ \\t])*))*|(?:[^()<>@,;:\\\\\".\\[\\] \\000-\\031]+(?:(?:(?:\\r\\n)?[ \\t])+|\\Z|(?=[\\[\"()<>@,;:\\\\\".\\[\\]]))|\"(?:[^\\\"\\r\\\\]|\\\\.|(?:(?:\\r\\n)?[ \\t]))*\"(?:(?:\\r\\n)?[ \\t])*)*\\<(?:(?:\\r\\n)?[ \\t])*(?:@(?:[^()<>@,;:\\\\\".\\[\\] \\000-\\031]+(?:(?:(?:\\r\\n)?[ \\t])+|\\Z|(?=[\\[\"()<>@,;:\\\\\".\\[\\]]))|\\[([^\\[\\]\\r\\\\]|\\\\.)*\\](?:(?:\\r\\n)?[ \\t])*)(?:\\.(?:(?:\\r\\n)?[ \\t])*(?:[^()<>@,;:\\\\\".\\[\\] \\000-\\031]+(?:(?:(?:\\r\\n)?[ \\t])+|\\Z|(?=[\\[\"()<>@,;:\\\\\".\\[\\]]))|\\[([^\\[\\]\\r\\\\]|\\\\.)*\\](?:(?:\\r\\n)?[ \\t])*))*(?:,@(?:(?:\\r\\n)?[ \\t])*(?:[^()<>@,;:\\\\\".\\[\\] \\000-\\031]+(?:(?:(?:\\r\\n)?[ \\t])+|\\Z|(?=[\\[\"()<>@,;:\\\\\".\\[\\]]))|\\[([^\\[\\]\\r\\\\]|\\\\.)*\\](?:(?:\\r\\n)?[ \\t])*)(?:\\.(?:(?:\\r\\n)?[ \\t])*(?:[^()<>@,;:\\\\\".\\[\\] \\000-\\031]+(?:(?:(?:\\r\\n)?[ \\t])+|\\Z|(?=[\\[\"()<>@,;:\\\\\".\\[\\]]))|\\[([^\\[\\]\\r\\\\]|\\\\.)*\\](?:(?:\\r\\n)?[ \\t])*))*)*:(?:(?:\\r\\n)?[ \\t])*)?(?:[^()<>@,;:\\\\\".\\[\\] \\000-\\031]+(?:(?:(?:\\r\\n)?[ \\t])+|\\Z|(?=[\\[\"()<>@,;:\\\\\".\\[\\]]))|\"(?:[^\\\"\\r\\\\]|\\\\.|(?:(?:\\r\\n)?[ \\t]))*\"(?:(?:\\r\\n)?[ \\t])*)(?:\\.(?:(?:\\r\\n)?[ \\t])*(?:[^()<>@,;:\\\\\".\\[\\] \\000-\\031]+(?:(?:(?:\\r\\n)?[ \\t])+|\\Z|(?=[\\[\"()<>@,;:\\\\\".\\[\\]]))|\"(?:[^\\\"\\r\\\\]|\\\\.|(?:(?:\\r\\n)?[ \\t]))*\"(?:(?:\\r\\n)?[ \\t])*))*@(?:(?:\\r\\n)?[ \\t])*(?:[^()<>@,;:\\\\\".\\[\\] \\000-\\031]+(?:(?:(?:\\r\\n)?[ \\t])+|\\Z|(?=[\\[\"()<>@,;:\\\\\".\\[\\]]))|\\[([^\\[\\]\\r\\\\]|\\\\.)*\\](?:(?:\\r\\n)?[ \\t])*)(?:\\.(?:(?:\\r\\n)?[ \\t])*(?:[^()<>@,;:\\\\\".\\[\\] \\000-\\031]+(?:(?:(?:\\r\\n)?[ \\t])+|\\Z|(?=[\\[\"()<>@,;:\\\\\".\\[\\]]))|\\[([^\\[\\]\\r\\\\]|\\\\.)*\\](?:(?:\\r\\n)?[ \\t])*))*\\>(?:(?:\\r\\n)?[ \\t])*))*)?;\\s*)");
    if (!ptr.matcher(destination_email).matches()) {
     System.out.println("Invalid email : " + destination_email);
     continue;
    }

    // Figure out the name to be used in the email content
    if (st.hasMoreTokens()) {
     // if the email db has the name use it
     name = st.nextElement().toString();
    } else {
     // if not then use the string before @ as the name
     int index = destination_email.indexOf('@');
     name = destination_email.substring(0, index);
    }

    Destination destination = new Destination().withToAddresses(destination_email);

    // Use the velocity template to create the html
    context.put("name", name);
    StringWriter writer = new StringWriter();
    t.merge(context, writer);

    // Create the email content to be sent
    Content subject = new Content().withData(email_subject);
    Content textBody = new Content().withData(writer.toString());
    Body body = new Body().withHtml(textBody);
    Message message = new Message().withSubject(subject).withBody(body);
    SendEmailRequest request = new SendEmailRequest().withSource(from_address).withDestination(destination)
      .withMessage(message).withConfigurationSetName(ses_configuration_set);

    // Send the email using SES
    client.sendEmail(request);

    System.out
      .println(count + " -- Sent email to " + destination_email + " with name as " + name + ".....");

    // Sleep as AWS SES puts a limit on how many email can be sent per second
    Thread.sleep(sleep_in_milliseconds);

   }

   in.close();

   System.out.println("\n\nAll the emails sent!");

  } catch (Exception ex) {
   System.out.println("\n\nAll the emails have not been sent. Please send the below error.");
   ex.printStackTrace();
  }
 }
}

File 2 : GetMessagesFromSQS.java - Uses the AWS Java SDK API to get the Bounce, Click, Complaint, Open events from the SQS Queue.

package com.thecloudavenue;

import java.io.BufferedWriter;
import java.io.File;
import java.io.FileInputStream;
import java.io.FileWriter;
import java.io.InputStream;
import java.util.List;
import java.util.Properties;

import com.amazonaws.regions.Regions;
import com.amazonaws.services.sqs.AmazonSQS;
import com.amazonaws.services.sqs.AmazonSQSClientBuilder;
import com.amazonaws.services.sqs.model.DeleteMessageRequest;
import com.amazonaws.services.sqs.model.GetQueueAttributesRequest;
import com.amazonaws.services.sqs.model.GetQueueAttributesResult;
import com.amazonaws.services.sqs.model.Message;
import com.amazonaws.services.sqs.model.ReceiveMessageRequest;

public class GetMessagesFromSQS {

 public static void main(String[] args) throws Exception {

  System.out.println("Attempting to get messages from Amazon SQS .....");

  try {

   if (args.length == 0) {
    System.out.println(
      "Proper Usage is: java -cp jar_path com.thecloudavenue.GetMessagesFromSQS config.properties_path");
    System.exit(0);
   }

   File f = new File(args[0]);
   if (!f.isFile()) {
    System.out.println(
      "Please make sure that the config.properties is specified properly in the command line");
    System.exit(0);
   }

   System.out.println("\n\nLoading the config.properties file .....");
   Properties prop = new Properties();
   InputStream input = null;
   input = new FileInputStream(args[0]);
   prop.load(input);

   String message_output_file_path = new String(prop.getProperty("message_output_file_path"));
   String sqs_queue_name = new String(prop.getProperty("sqs_queue_name"));

   AmazonSQS sqs = AmazonSQSClientBuilder.standard().withRegion(Regions.US_EAST_1).build();
   String myQueueUrl = sqs.getQueueUrl(sqs_queue_name).getQueueUrl();

   int approximateNumberOfMessages = 0;

   GetQueueAttributesResult gqaResult = sqs.getQueueAttributes(
     new GetQueueAttributesRequest(myQueueUrl).withAttributeNames("ApproximateNumberOfMessages"));
   if (gqaResult.getAttributes().size() == 0) {
    System.out.println("Queue " + sqs_queue_name + " has no attributes");
   } else {
    for (String key : gqaResult.getAttributes().keySet()) {
     System.out.println(String.format("\n%s = %s", key, gqaResult.getAttributes().get(key)));
     approximateNumberOfMessages = Integer.parseInt(gqaResult.getAttributes().get(key));

    }
   }

   FileWriter fstream = new FileWriter(message_output_file_path, true);
   BufferedWriter out = new BufferedWriter(fstream);

   ReceiveMessageRequest receiveMessageRequest = new ReceiveMessageRequest(myQueueUrl);
   receiveMessageRequest.setMaxNumberOfMessages(10);

   int pendingNumberOfMessages = approximateNumberOfMessages;

   for (int i = 1; i <= approximateNumberOfMessages; i++) {

    List messages = sqs.receiveMessage(receiveMessageRequest).getMessages();
    int count = messages.size();
    System.out.println("\ncount == " + count);

    for (Message message : messages) {

     System.out.println("Writing the message to the file .....");
     out.write(message.getBody());

     System.out.println("Deleting the message from the queue .....");
     String messageRecieptHandle = message.getReceiptHandle();
     sqs.deleteMessage(new DeleteMessageRequest(myQueueUrl, messageRecieptHandle));

    }

    pendingNumberOfMessages = pendingNumberOfMessages - count;
    System.out.println("pendingNumberOfMessages = " + pendingNumberOfMessages);

    if (pendingNumberOfMessages <= 0) {
     break;
    }
   }

   out.close();

   System.out.println("\n\nGot all the messages into the file");

  } catch (Exception ex) {
   System.out.println("\n\nAll the messages have not been got from the queue. Please send the below error.");
   ex.printStackTrace();
  }
 }
}

File 3 : emaildb.txt - List of emails. It can also take the name of the person after the comma to send a customized email. The last two emails are used to test bounce and complaints, more here.

praveen4cloud@gmail.com,praveen
praveensripati@gmail.com
bounce@simulator.amazonses.com
complaint@simulator.amazonses.com

File 4 : email.vm - The email template which has to be sent. The $name in the below email template will be replaced with the name from the above file. If the name is not there then the email id will be used in place of the $name.

<html>
 <body>
  Dear $name,<br/><br/>
   Welcome to thecloudavenue.com <a href="http://www.thecloudavenue.com/">here</a>.<br/><br/>
  Thanks,
  Praveen
 </body>
</html>

File 5 : config.properties - The properties to configure the Java program. Note that the path may be need to be modified as per where the files have been placed.

email_db_path=E:\\WorkSpace\\SES\\sendemail\\resources\\emaildb.txt
message_output_file_path=E:\\WorkSpace\\SES\\sendemail\\resources\\out.txt

from_address="Praveen Sripati" <praveensripati@gmail.com>
email_subject=Exciting oppurtunities in Big Data
template_file_path=E:\\WorkSpace\\SES\\sendemail\\resources
template_file_name=email.vm
sleep_in_milliseconds=100

ses_configuration_set=email-project
sqs_queue_name=SESQueue

File 6 : pom.xml - The maven file with all the dependencies.

<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
 xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
 <modelVersion>4.0.0</modelVersion>
 <groupId>com.thecloudavenue</groupId>
 <artifactId>sendemail</artifactId>
 <version>0.0.1-SNAPSHOT</version>
 <name>sendemail</name>
 <dependencies>
  <dependency>
   <groupId>org.apache.velocity</groupId>
   <artifactId>velocity</artifactId>
   <version>1.7</version>
   <scope>compile</scope>
  </dependency>
  <dependency>
   <groupId>com.amazonaws</groupId>
   <artifactId>aws-java-sdk</artifactId>
   <version>1.11.179</version>
   <scope>compile</scope>
  </dependency>
  <dependency>
   <groupId>com.amazonaws</groupId>
   <artifactId>amazon-kinesis-client</artifactId>
   <version>1.2.1</version>
   <scope>compile</scope>
  </dependency>
 </dependencies>
 <build>
  <plugins>
   <plugin>
    <groupId>org.apache.maven.plugins</groupId>
    <artifactId>maven-shade-plugin</artifactId>
    <version>3.0.0</version>
    <executions>
     <execution>
      <phase>package</phase>
      <goals>
       <goal>shade</goal>
      </goals>
     </execution>
    </executions>
   </plugin>
  </plugins>
 </build>
</project>

File 7 : credentials file in the profile folder. I have kept mine in the C:\Users\psripati\.aws folder under windows. Create the access keys as mentioned here. The format of the credentials file is mentioned here. Note that it's better to create an IAM user with the permissions to send SES emails and to read messages from the SQS queue. The access keys have to be created for this user.

Compile the java files and package them as a jar file. Now lets see how to run the program.

For sending the emails, open the command prompt and run the below command. Note to replace the correct sendemail-0.0.1-SNAPSHOT.jar and the config.properties path.

java -cp E:\WorkSpace\SES\sendemail\target\sendemail-0.0.1-SNAPSHOT.jar com.thecloudavenue.SendEmailViaSES E:\WorkSpace\SES\sendemail\resources\config.properties

For getting the list of complaints, bounces, opened and clicked emails, open the command prompt and run the below command. Note to replace the correct sendemail-0.0.1-SNAPSHOT.jar and the config.properties path.

java -cp E:\WorkSpace\SES\sendemail\target\sendemail-0.0.1-SNAPSHOT.jar com.thecloudavenue`.GetMessagesFromSQS E:\WorkSpace\SES\sendemail\resources\config.properties

I know that there are a sequence of steps to get started for using SES, but once the whole setup has been done it should be piece of cake to use SES for sending emails. On top of every thing, it's a lot cheaper sending emails via SES than setting up a email server.

Note : This article has used Apache Velocity which is a Java based template engine for sending customized emails to the email recipients. AWS SES has included this functionality in the SDK itself, so there no need to use Apache Velocity. Here is a blog on the same.

Big Data and Cloud Tips

Pages