Friday, July 6, 2018

What is DIGITAL?

Very often we hear the word DIGITAL in the quarterly results of the different IT companies especially in India. The revenue from the DIGITAL business is compared with the traditional business. So, what is DIGITAL? There is no formal definition of DIGITAL, but has been loosely used by different companies as mentioned lately.

But, here is the definition of DIGITAL in an interview at MoneyControl (here) by Rostow Ravanan, Mindtree CEO and MD. This is a bit vague, but the best I could get till now. The vagueness comes from the fact that it doesn't say what BETTER is. Does anyone see something missing? I see IOT missing. Lately I had been working on IOT and would be writing my opinion on where IOT stands as of now.

Q: Digital is still a vague term in the minds of many. What does it mean for you?

A: So let me go back a little bit and tell you what we define as digital. We define digital and we put that in our factsheet, whenever we declare results every quarter.

In our definition of digital, we take one or two ways of defining it. To a business user, we definite digital from a business process perspective to say anything that allows my customer to connect to their customer better or anything that allows my customer to connect to their people better is one way of defining digital from a business process point of view.

Or if you were to look at digital from a technology definition point of view, we say it is social, mobility, analytics, cloud, and e-commerce. From a technology point of view, that is how we define digital.

Thursday, November 2, 2017

Refining the existing AWS Security Groups

I am a big fan of blogs/articles which use multiple services from AWS. Each of the services from AWS is powerful, but when we combine them in different ways we can achieve a lot more.

When an organization deploys an application in the Cloud, over time there can be some port numbers in the Security Groups which are not required for the functionality of the application. These unnecessary ports might be a security risk to the organization. So, it's always better to open the minimum set of port numbers required.

AWS doesn't give a direct way to identify the unused ports, the VPC flow logs have to be captured and analyzed to identify the unused port numbers and the corresponding Network Interfaces and Security Groups. So, below are two blogs from AWS on the same.

How to Optimize and Visualize Your Security Groups

How to Visualize and Refine Your Network’s Security by Adding Security Group IDs to Your VPC Flow Logs

The end results are the same for the two blogs, but they do use different services from AWS. The blogs are pretty straightforward to follow. After trying it out, you would be familiar with VPC flow logs, Lambda, Kinesis, Elasticsearch and IAM.

For those who are getting started with AWS, I would definitely recommend going through the above blogs in the same order. Depending on the technology comfort, it might take time, but the blogs are worth trying it out.

Tuesday, October 31, 2017

Managing the Raspberry Pi from Laptop

Pi (short for Raspberry Pi) is a single-board computer (SBC). It has a couple of USB ports to which keyboard, mouse and other peripherals can be connected. An HDMI port to connect a monitor, a MicroSD slot for OS/applications/data and a mini USB port for power supply. There are a bunch of models in Pi and I have bought Raspberry Pi 2 Model B a few years back and plugged the different components together as shown below. Finally, installed Raspbian on it. Note that the green casing is not part of the Pi, it had to be ordered additionally.

The right side cables are to the USB ports to which a mouse and keyboard are connected. On top of the USB cable is the WiFi dongle. The later models of Pi have inbuilt support for WiFi, but this model doesn't have. The left side cables are HDMI and power supply. It's a cool setup for the kids to get started with computers. It's easy to setup, but those who are scared there are a few choices of laptops built on pi as the new pi-top. The Pi has a few GPIO ports to which different sensors (light, temperature, humidity etc) can be connected to get the ambiance conditions and take some actions using actuators.

The above configuration is cool, but is not mobile as it's much like a desktop. So, I was looking for options to connect it to the Laptop and use the keyboard, mouse, monitor and power from the laptop. Here I got the instructions for the same. Finally. I was able to manage the Pi from the laptop as shown below. The Pi Desktop is there on the Laptop. Not sure why, but the VNC server on the Pi stopped starting automatically. So, I had to login via ssh to the Pi and start the VNC server.

Thursday, October 26, 2017

Installing WordPress on AWS Cloud

In this blog, we would be installing WordPress which is a popular CMS (Content Management System) on EC2 and RDS. The WordPress software is widely used to create blogs and web sites. This configuration is not recommended for production. At the end of the blog, the additional tasks to be done to make the entire thing a bit more robust would be mentioned. The end product is as below.

The WordPress software would be running on a Ubuntu EC2 instances and the data would be stored in MySQL RDS instance. Starting the RDS takes time, so first we would start the RDS instance and then the EC2 instance with WordPress on it.

Tuesday, October 24, 2017

Amazon Macie and S3 Security

AWS S3 got into limelight lately for wrong reasons, more here (1, 2, 3). S3 security policies are a pain in the neck to understand, we will cover about security in the context of S3 in a detailed blog later. Before the cloud was there it took a few days to weeks for procuring the hardware, software, setting them etc. But with the cloud, it takes a few minutes to create an S3 bucket, put some sensitive data and finally set some wrong permissions on them.

Meanwhile, AWS launched Macie to protect sensitive data from getting into the wrong hands. Here is the blog from AWS launching Macie and here the documentation on how to get started with Macie. The blog explains nicely on how to get started with Macie. Also, look at the Macy FAQ here. Initially, Macie covers only S3 data, the plan is to roll Macie for other services like EC2.

Wednesday, October 18, 2017

Getting notified for any state change in the AWS resources (like EC2 going down)

EC2 can be used for any purpose like running a website, doing some batch processing. A website has a requirement to run 99.9 or 99.99 or some other percentage of the time, so back up of the EC2 instances are required for the sake of high availability. But, lets take the case of batch as in the case of transforming 1000's of records from one format to another, then high availability is not really important. If an instance fails then another can be started or the work can be automatically shifted to some other instance automatically as in the case of Big Data.

Just to quickly summarize, in the case as in the case of web servers we need some level of high availability and so multiple EC2 instances (backup), but in the case of batch processing there is no need of backup. Lets take the case of a batch job running on a single EC2 instance, it would be good to get some sort of notification when the instance goes down. We would be looking into the same in this blog. We would be using EC2, SNS and CloudWatch. So, it would be a good way to get familiar with the different topics.

So, here are the steps.

Step 1: Create a new topic in the SNS management console.

Tuesday, October 17, 2017

Microsoft Azure for AWS Professionals

A few months back I blogged about 'GCP for AWS Professionals' comparing the two platforms here. Now, Microsoft has published something similar comparing Microsoft Azure with Amazon AWS here.

It's good to know for Amazon AWS when their competitors are comparing their services with Amazon's. AWS has been adding new services and features (small and big) within them at a very rapid pace. Here you can get the new features introduced in Amazon AWS on a daily basis.

Similar to GCP and AWS, Azure also gives free credit to get started. So, now is the time to create an account and get started with Cloud. So, here are the links for the same (AWS, Azure and GCP).

Getting the execution times of the EMR Steps from the CLI

In the previous blog, we executed a Hive script to convert the Airline dataset from the original csv format to Parquet Snappy format. And then same query were run to csv and the Parquet Snappy format data to see the performance improvements. This involved three steps.

Step 1 : Create the ontime and the ontime_parquet_snappy table. Move the data from ontime table to the ontime_parquet_snappy table for the conversion of one format to another.

Step 2 : Execute the query on the ontime table, which represents the csv data.

Step 3 : Execute the query on the ontime_parquet_snappy time, which representa the Parquet Snappy data.

The execution time for the above three steps was got from the AWS EMR management console which is a Web UI. All the tasks which can be done from the AWS management console can also be done from the CLI (Command Line Interface) also. Lets see the steps involved to get the execution time for the steps in EMR.

Different ways of executing the Big Data processing jobs in EMR

There are different ways of kick starting a Hive/Pig/MR/Spark on Amazon EMR. We already looked at how to submit a Hive job or a step from the AWS EMR management console here. This approach is cool, but doesn't have much scope for automation.

Here are the other ways to start the Big Data Processing with some level of automation.

1) Use Apache Oozie to create a workflow and a coordinator.
2) Use the AWS CLI
3) Login to the master instance and use the Hive shell

In the above, Option 1 is a bit complicated and will be explored in another blog. Here we will be looking at the other two options.

Option 2 : Using the AWS CLI

Step 1 : Create the airline.sql with the below content. The below will create a table in Hive and map it to the data in S3. To get the data into S3 follow this article. Then a query will be run on the table.
create external table ontime_parquet_snappy (
  Year INT,
  Month INT,
  DayofMonth INT,
  DayOfWeek INT,
  DepTime  INT,
  CRSDepTime INT,
  ArrTime INT,
  CRSArrTime INT,
  UniqueCarrier STRING,
  FlightNum INT,
  TailNum STRING,
  ActualElapsedTime INT,
  CRSElapsedTime INT,
  AirTime INT,
  ArrDelay INT,
  DepDelay INT,
  Origin STRING,
  Dest STRING,
  Distance INT,
  TaxiIn INT,
  TaxiOut INT,
  Cancelled INT,
  CancellationCode STRING,
  Diverted STRING,
  CarrierDelay INT,
  WeatherDelay INT,
  NASDelay INT,
  SecurityDelay INT,
  LateAircraftDelay INT
) STORED AS PARQUET LOCATION 's3://airline-dataset/airline-parquet-snappy/' TBLPROPERTIES ("orc.compress"="SNAPPY");

INSERT OVERWRITE DIRECTORY 's3://airline-dataset/parquet-snappy-query-output' select Origin, count(*) from ontime_parquet_snappy where DepTime > CRSDepTime group by Origin; 

EMR logging into the master instance

Once we spawn a Cluster as mentioned here, we should see the instances in the EC2 management console. It would be nice to login to the master instance. All the log files are generated on the master and then moved to S3. Also, the different Big Data processing jobs can be run from the master command line interface.

In this blog we will look into connecting to the master. The AWS documentation for the same is here.

Step 1 : Click on the gear button on the top right. The columns in the page can be added or deleted here.

Include the EMR related keys as shown in the right of the above screen shot and the EC2 instance roles (MASTER and CORE) will be displayed as shown below.

Get the DNS hostname of the master instance after select it.