Wednesday, October 14, 2020

Applications around the intersection of Big Data / Machine Learning and AWS

As many of the readers of this blog know I am a big fan of Big Data and the AWS Cloud, especially I am interested in the intersection of these two. But, Big Data processing requires huge number of machines, to process huge amounts of data and do some complex processing as in the case of Machine Learning.

Cloud has democratized the usage of Big Data, there is no need to buy any machines, we can spin a number of EC2 instances, do the Big Data processing and once done we can terminate the EC2 instances. AWS and other vendors are doing a lot of hardware and software innovations in this space, below are a few hardware innovations from AWS. They do require a lot of investment in the R&D and building them, which is usually possible at the scale Cloud operates.

AWS Nitro Systems : Some of the virtualization responsibilities have been shifted from the CPU to the dedicated hardware and software.

AWS Graviton Processor : The Graviton processor uses ARM based architecture, similar to the once used on mobile phones. Now we can spin EC2 with Graviton Processor.

AWS and Nvidia : They bring very high end GPU to the Cloud with the EC2 instances for Machine Learning modelling.

AWS Inferentia : Once the Machine Learning model has been created, the next step is inference which takes most of the CPU cycles. Inferentia is a custom chip from AWS for the same.

F1 Instances : Hardware acceleration on the EC2 using FPGA.

Coming back to the subject of this blog, AWS provides a few open data sets via S3 for free for us to do the processing in the Cloud and get some meaningful insights out of it. The data sets can be found here. For those who are familiar with either AWS or Big Data, the challenge is how to figure out how the intersection of these work together. For this AWS has published a bunch of blogs/articles here on the intersection of AWS and Big Data /Machine Learning for different domains. Below is a sample application around the intersection of Big Data and AWS around Genome data. Note that AWS has been highlighted, look out for more of them.


The intersection of Big Data / Machine Learning and AWS is very interesting. Cloud with the pricing democratizes the usage of Big Data / Machine Learning, but each one is a beast on its own to learn and there is a lot of innovation happening in this space and it's tough to keep in pace. Here are a few applications around these to get started. Good Luck !!!

Thursday, October 8, 2020

Setting up additional EC2 users with username/password and Keypair authentication

When an Ubuntu EC2 instances is created in the AWS Cloud, we should be able to connect to the EC2 using the username/password and the Keypairs. In the case of the Ubuntu AMI provided by AWS, only the Keypair authentication is enabled while the username/password authentication is disabled. Very often I get the query "How to create additional users for the Ubuntu EC2 with Keypair for authentication", so is the blog. At the end of the day, Linux is a Linux weather we run it in the Cloud, Laptop or in On-Premise, so the instructions apply everywhere.

Setting up an EC2 user with username/password authentication

Step 1: Create an Ubuntu EC2 instance and connect to it

Step 2: Add user "praveen" using the below command
#Enter the password and other details
sudo adduser praveen

Step 3: Open the "/etc/ssh/sshd_config" file and set "PasswordAuthentication" to yes

Step 4: Restart the ssh service
sudo service ssh restart

Step 5: Connect to the EC2 as the user "praveen" via Putty or some other software by specifying the password

Setting up an EC2 user with Keypair authentication

Step 1: Add user "sripati" and disable the the password authentication
#as we would be using the Keypair for authentication
sudo adduser sripati --disabled-password

Step 2: Switch as the user
sudo su - sripati

Step 3: Generate the keys. They would be in the .ssh folder

Step 4: Copy the public key to the authorized_keys file in the .ssh folder
cat .ssh/ >> .ssh/authorized_keys

Step 5: Copy the private key in the ~/.ssh/id_rsa to a file sripati.pem on your local machine
cat ~/.ssh/id_rsa

Step 6: Using PuttyGen convert the pem file to ppk. "Load" the pem file and "Save private key" in the ppk format.

Step 7: Now connect via Putty via the username as "sripati", the public IP of the EC2 instance and private key in the ppk format. There is no need to specify the password.

Tuesday, October 6, 2020

Provisioning AWS infrastructure using Ansible

Cloud infrastructure provision can be automated using code. The main advantage is that the process can be repeated with consistent output and the code can be version controlled in github, bitbucket or something else.

AWS comes with CloudFormation for automation of the provisioning of the AWS infrastructure, the main disadvantage is that CloudFormation template (code) is very specific to AWS and takes a lot of effort to migrate to some other Cloud. In this blog we will look at Ansible using which infrastructure can be provisioned for multiple Clouds and also migrating code to provision code to some Cloud doesn't take as much effort as with CloudFormation.

We would installing Ansible on an Ubuntu EC2 instance for provisioning of the AWS infrastructure. Ansible can be setup on Windows also, but as we install more and more softwares on Windows (host OS) directly, it becomes slow over time. So, I prefer to launch an EC2, try a few things and tear it down once done with it. Anyway, lets look at setting up Ansible and create AWS infrastructure on it.

 Step 1: Create an Ubuntu instances (t2.micro) and connect to it.

Step 2: Install Python and boto (AWS SDK for Python) on the EC2 instance using the below commands.

   sudo apt-get update
   sudo apt-get install python2.7 python-pip -y
   pip install boto

Step 3: Install Ansible using the below command.

   sudo apt install software-properties-common -y
   sudo apt-add-repository --yes --update ppa:ansible/ansible
   sudo apt install ansible -y

Step 4: Go to the IAM Management Console here (1) and create the Access Keys. Note them down.

Step 5: Export the Access Keys using the below commands. Make sure to replace 'ABC' and 'DEF' with the Access Keys which have been generated in the previous step.


Step 6: Create a file called "launch-ec2.yaml" with the below content. Make sure to replace the highlighted sections.

- name: Provision a set of instances
  hosts: localhost
    - name: Provision a set of instances
        key_name: my-keypair
        region: us-east-1
          - sg-0fa7df1dab4d7ebcb
          - sg-040f6c6ef9932dbb5
        instance_type: t2.micro
        image: ami-0bcc094591f354be2
        wait: yes
          Name: Demo
        exact_count: 1
        count_tag: Name
        assign_public_ip: yes
        vpc_subnet_id: subnet-59120577

Step 7: Execute the below command to launch an EC2 instance.

ansible-playbook launch-ec2.yaml

Step 8: Go to the EC2 Management Console and notice a new EC2 instance has been launched with the Name:Demo tag. Make sure to note down the "Instance ID" of the newly created EC2 instance.

Step 9: Create a file called "terminate-ec2.yaml" with the below content. Make sure to replace the highlighted section with the Instance ID of the EC2 got from the previous step.

- name: Terminate instances
  hosts: localhost
    - name: Terminate instances
        state: "absent"
        instance_ids: "i-08ef0942aabbc45d7"
        region: us-east-1
        wait: true

Step 10: Execute the below command to launch an EC2 instance.

ansible-playbook terminate-ec2.yaml

Step 11: Go back to the EC2 Management Console and notice that the EC2 which was created by Ansible will be in a terminated status within a few minutes.


By using YAML code, we were able to launch and terminate instance. Ansible allows to do lot of complicated things than this, this is something to start with. As mentioned earlier Ansible allows easy migration to some other Cloud vendor when compared to AWS CloudFormation. BTW, Ansible has been bought by Red Hat which has been bought by IBM. So, Ansible is part of IBM now.

For reference, here is the yaml code for launching and terminating the EC2 instances, the screen has been split horizontally using tmux.

Thursday, October 1, 2020

Automating EC2 or Linux tasks using "tmux"

A lot of times we do create multiple EC2 instances and install the same software on each one of them manually, this can be for trying out a Load Balancer feature or to test routing with High Availability across different Regions and Availability Zones. One way to avoid this manual process is to create an AMI, but they are immutable and a new AMI has to be created for even small changes. This is where tmux (Terminal Multiplexer) comes into play.

Here the assumptions is that we want three EC2 instances as shown above and they are fronted by an ELB, which will load balance the traffic across these EC2 instances. On each of these instances we would like to install Apache2 and create webpages. For this, we would be using one of the EC2 as the jump or bastion box and connect to the other two EC2 instances from here as shown below.

Step 1: Start three EC2 Ubuntu instances and name them as "WS1/Jump/BastionBox", "WS2" and "WS3".

Step 2: Download pagent.exe from here (1) and click on "Add Key" and point to the Private Key in the ppk format. Close the window.

Step 3: Connect to the EC2 instance names as "WS1/Jump/BastionBox" via Putty. In the "Host Name (or IP address)" specify the username and the IP as show below.

Go to "Connection --> SSH --> Auth" and make sure to select "Allow agent forwarding". This makes it easy to connect to the EC2 instances, as there is no need to specify the Private Key, it would be picked from pagent.exe. Click on "Open" to connect to the EC2 instance.

Step 4: Execute the tmux command to start it.

Step 5: Enter "Ctrl + B" and "%" to split the panes horizontally. Again enter "Ctrl + B" and "Double Quotes" to split the panes vertically. Now we should see three panes as shown below. Use the "Ctrl + B" and the arrow buttons to navigate the panes.

Step 6:  On the right side upper and bottom panes execute the "ssh ubuntu@ip" command to login to the EC2 instances. Make sure to replace the IP address of WS2 and WS3 EC2 instances in the command. 

Step 7: Now we are connected to three EC2 instances as shown below. Execute the "ifconfig" command on all the panes and note that the IP address should be different. This is to make sure we are connected to different EC2 instances.

Step 8: Now we will turn on the synchronization across the panes, this way any command executed on the panes will be automatically executed on the other panes also automatically. For synchronization to happen enter "Ctrl-B " and ":" and "setw synchronize-panes on" and "Enter Button". Use the setw command with "off" options to turn off the synchronization across the panes.

Step 9: Navigate to one of the pane and notice that any command executed in one of the pane would get executed in the other panes. Ain't it neat !!!


When we want to automate tasks AWS provides a few means like SSM, OpsWorks, AMI and so on. But, there are good for automating on the long run, but not good when we want to try different things in an iterative approach or we are really not sure what we want to do.

This is where tmux with the synchronization feature comes handy. There is lot more to tmux, but hope this blog articles helps you to get started with tmux and builds the curiosity around it.