Big Data and Cloud Tips: 2020

Wednesday, October 14, 2020

Applications around the intersection of Big Data / Machine Learning and AWS

As many of the readers of this blog know I am a big fan of Big Data and the AWS Cloud, especially I am interested in the intersection of these two. But, Big Data processing requires huge number of machines, to process huge amounts of data and do some complex processing as in the case of Machine Learning.

Cloud has democratized the usage of Big Data, there is no need to buy any machines, we can spin a number of EC2 instances, do the Big Data processing and once done we can terminate the EC2 instances. AWS and other vendors are doing a lot of hardware and software innovations in this space, below are a few hardware innovations from AWS. They do require a lot of investment in the R&D and building them, which is usually possible at the scale Cloud operates.

AWS Nitro Systems : Some of the virtualization responsibilities have been shifted from the CPU to the dedicated hardware and software.

AWS Graviton Processor : The Graviton processor uses ARM based architecture, similar to the once used on mobile phones. Now we can spin EC2 with Graviton Processor.

AWS and Nvidia : They bring very high end GPU to the Cloud with the EC2 instances for Machine Learning modelling.

AWS Inferentia : Once the Machine Learning model has been created, the next step is inference which takes most of the CPU cycles. Inferentia is a custom chip from AWS for the same.

F1 Instances : Hardware acceleration on the EC2 using FPGA.

Coming back to the subject of this blog, AWS provides a few open data sets via S3 for free for us to do the processing in the Cloud and get some meaningful insights out of it. The data sets can be found here. For those who are familiar with either AWS or Big Data, the challenge is how to figure out how the intersection of these work together. For this AWS has published a bunch of blogs/articles here on the intersection of AWS and Big Data /Machine Learning for different domains. Below is a sample application around the intersection of Big Data and AWS around Genome data. Note that AWS has been highlighted, look out for more of them.

Summary

The intersection of Big Data / Machine Learning and AWS is very interesting. Cloud with the pricing democratizes the usage of Big Data / Machine Learning, but each one is a beast on its own to learn and there is a lot of innovation happening in this space and it's tough to keep in pace. Here are a few applications around these to get started. Good Luck !!!

Thursday, October 8, 2020

Setting up additional EC2 users with username/password and Keypair authentication

When an Ubuntu EC2 instances is created in the AWS Cloud, we should be able to connect to the EC2 using the username/password and the Keypairs. In the case of the Ubuntu AMI provided by AWS, only the Keypair authentication is enabled while the username/password authentication is disabled. Very often I get the query "How to create additional users for the Ubuntu EC2 with Keypair for authentication", so is the blog. At the end of the day, Linux is a Linux weather we run it in the Cloud, Laptop or in On-Premise, so the instructions apply everywhere.

Setting up an EC2 user with username/password authentication

Step 1: Create an Ubuntu EC2 instance and connect to it

Step 2: Add user "praveen" using the below command

#Enter the password and other details

sudo adduser praveen

Step 3: Open the "/etc/ssh/sshd_config" file and set "PasswordAuthentication" to yes

Step 4: Restart the ssh service

sudo service ssh restart

Step 5: Connect to the EC2 as the user "praveen" via Putty or some other software by specifying the password

Setting up an EC2 user with Keypair authentication

Step 1: Add user "sripati" and disable the the password authentication

#as we would be using the Keypair for authentication

sudo adduser sripati --disabled-password

Step 2: Switch as the user

sudo su - sripati

Step 3: Generate the keys. They would be in the .ssh folder

ssh-keygen

Step 4: Copy the public key to the authorized_keys file in the .ssh folder

cat .ssh/id_rsa.pub >> .ssh/authorized_keys

Step 5: Copy the private key in the ~/.ssh/id_rsa to a file sripati.pem on your local machine

cat ~/.ssh/id_rsa

Step 6: Using PuttyGen convert the pem file to ppk. "Load" the pem file and "Save private key" in the ppk format.

Step 7: Now connect via Putty via the username as "sripati", the public IP of the EC2 instance and private key in the ppk format. There is no need to specify the password.

Tuesday, October 6, 2020

Provisioning AWS infrastructure using Ansible

Cloud infrastructure provision can be automated using code. The main advantage is that the process can be repeated with consistent output and the code can be version controlled in github, bitbucket or something else.

AWS comes with CloudFormation for automation of the provisioning of the AWS infrastructure, the main disadvantage is that CloudFormation template (code) is very specific to AWS and takes a lot of effort to migrate to some other Cloud. In this blog we will look at Ansible using which infrastructure can be provisioned for multiple Clouds and also migrating code to provision code to some Cloud doesn't take as much effort as with CloudFormation.

We would installing Ansible on an Ubuntu EC2 instance for provisioning of the AWS infrastructure. Ansible can be setup on Windows also, but as we install more and more softwares on Windows (host OS) directly, it becomes slow over time. So, I prefer to launch an EC2, try a few things and tear it down once done with it. Anyway, lets look at setting up Ansible and create AWS infrastructure on it.

Step 1: Create an Ubuntu instances (t2.micro) and connect to it.

Step 2: Install Python and boto (AWS SDK for Python) on the EC2 instance using the below commands.

sudo apt-get update

sudo apt-get install python2.7 python-pip -y

pip install boto

Step 3: Install Ansible using the below command.

sudo apt install software-properties-common -y

sudo apt-add-repository --yes --update ppa:ansible/ansible

sudo apt install ansible -y

Step 4: Go to the IAM Management Console here (1) and create the Access Keys. Note them down.

Step 5: Export the Access Keys using the below commands. Make sure to replace 'ABC' and 'DEF' with the Access Keys which have been generated in the previous step.

export AWS_ACCESS_KEY_ID='ABC'

export AWS_SECRET_ACCESS_KEY='DEF'

Step 6: Create a file called "launch-ec2.yaml" with the below content. Make sure to replace the highlighted sections.

- name: Provision a set of instances

hosts: localhost

tasks:

- name: Provision a set of instances

ec2:

key_name: my-keypair

region: us-east-1

group_id:

- sg-0fa7df1dab4d7ebcb

- sg-040f6c6ef9932dbb5

instance_type: t2.micro

image: ami-0bcc094591f354be2

wait: yes

instance_tags:

Name: Demo

exact_count: 1

count_tag: Name

assign_public_ip: yes

vpc_subnet_id: subnet-59120577

Step 7: Execute the below command to launch an EC2 instance.

ansible-playbook launch-ec2.yaml

Step 8: Go to the EC2 Management Console and notice a new EC2 instance has been launched with the Name:Demo tag. Make sure to note down the "Instance ID" of the newly created EC2 instance.

Step 9: Create a file called "terminate-ec2.yaml" with the below content. Make sure to replace the highlighted section with the Instance ID of the EC2 got from the previous step.

- name: Terminate instances

hosts: localhost

tasks:

- name: Terminate instances

ec2:

state: "absent"

instance_ids: "i-08ef0942aabbc45d7"

region: us-east-1

wait: true

Step 10: Execute the below command to launch an EC2 instance.

ansible-playbook terminate-ec2.yaml

Step 11: Go back to the EC2 Management Console and notice that the EC2 which was created by Ansible will be in a terminated status within a few minutes.

Conclusion

By using YAML code, we were able to launch and terminate instance. Ansible allows to do lot of complicated things than this, this is something to start with. As mentioned earlier Ansible allows easy migration to some other Cloud vendor when compared to AWS CloudFormation. BTW, Ansible has been bought by Red Hat which has been bought by IBM. So, Ansible is part of IBM now.

For reference, here is the yaml code for launching and terminating the EC2 instances, the screen has been split horizontally using tmux.

Thursday, October 1, 2020

Automating EC2 or Linux tasks using "tmux"

A lot of times we do create multiple EC2 instances and install the same software on each one of them manually, this can be for trying out a Load Balancer feature or to test routing with High Availability across different Regions and Availability Zones. One way to avoid this manual process is to create an AMI, but they are immutable and a new AMI has to be created for even small changes. This is where tmux (Terminal Multiplexer) comes into play.

Here the assumptions is that we want three EC2 instances as shown above and they are fronted by an ELB, which will load balance the traffic across these EC2 instances. On each of these instances we would like to install Apache2 and create webpages. For this, we would be using one of the EC2 as the jump or bastion box and connect to the other two EC2 instances from here as shown below.

Step 1: Start three EC2 Ubuntu instances and name them as "WS1/Jump/BastionBox", "WS2" and "WS3".

Step 2: Download pagent.exe from here (1) and click on "Add Key" and point to the Private Key in the ppk format. Close the window.

Step 3: Connect to the EC2 instance names as "WS1/Jump/BastionBox" via Putty. In the "Host Name (or IP address)" specify the username and the IP as show below.

Go to "Connection --> SSH --> Auth" and make sure to select "Allow agent forwarding". This makes it easy to connect to the EC2 instances, as there is no need to specify the Private Key, it would be picked from pagent.exe. Click on "Open" to connect to the EC2 instance.

Step 4: Execute the tmux command to start it.

Step 5: Enter "Ctrl + B" and "%" to split the panes horizontally. Again enter "Ctrl + B" and "Double Quotes" to split the panes vertically. Now we should see three panes as shown below. Use the "Ctrl + B" and the arrow buttons to navigate the panes.

Step 6: On the right side upper and bottom panes execute the "ssh ubuntu@ip" command to login to the EC2 instances. Make sure to replace the IP address of WS2 and WS3 EC2 instances in the command.

Step 7: Now we are connected to three EC2 instances as shown below. Execute the "ifconfig" command on all the panes and note that the IP address should be different. This is to make sure we are connected to different EC2 instances.

Step 8: Now we will turn on the synchronization across the panes, this way any command executed on the panes will be automatically executed on the other panes also automatically. For synchronization to happen enter "Ctrl-B " and ":" and "setw synchronize-panes on" and "Enter Button". Use the setw command with "off" options to turn off the synchronization across the panes.

Step 9: Navigate to one of the pane and notice that any command executed in one of the pane would get executed in the other panes. Ain't it neat !!!

Conclusion

When we want to automate tasks AWS provides a few means like SSM, OpsWorks, AMI and so on. But, there are good for automating on the long run, but not good when we want to try different things in an iterative approach or we are really not sure what we want to do.

This is where tmux with the synchronization feature comes handy. There is lot more to tmux, but hope this blog articles helps you to get started with tmux and builds the curiosity around it.

Tuesday, September 29, 2020

Using the same Keypair across AWS Regions

In one of the previous blog (1), we looked what happens behind the scenes when we use a Keypair for authentications against Linux. This blog post is more about productivity. I do create and connect to EC2 instances quite often and so I have created Sessions in Putty for most of my regularly connected Linux instances. One of the Session is for AWS which automatically populates the username and the keypair as shown below. When I would like to connect to an EC2 instance all I need to specify the Public IP address of the EC2 instance.

It all looks fine and dandy, the only problem is when I create EC2 instances in different AWS regions to test High Availability or some other features and try to connect to them. With the above approach since the Keypairs have regional scope, when I connect to EC2 instances in different regions, I need to change the keypairs in Putty. It would be good to use the same Keypair across regions, this way I don't need to change when connecting to the EC2 in different regions when using Putty saved sessions feature. Let's look at how to.

Step 1: Download putty.exe and puttygen.exe from here (1). There is no need to install it, just downloading should be good enough.

Step 2: Go to the EC2 Management Console and create a Keypair. Generate the Keypair by selecting the pem or ppk format.

Step 3: When prompted store the private key.

The Keypair should be created as shown below.

Step 4: Start PuttyGen and click on Load.

Step 5: Point to the private key which has been downloaded earlier. If the file is not visible then remove the filter and select "All files (*.*)". Click on Open and click on OK.

Step 6: Click on "Save public key" and specify the same file name but with a pub extension as shown below.

Step 7: Go to the EC2 Management Console for some other region and navigate to the Keypair tab. Click on Actions and then "Import key pair".

Step 8: Click on "Choose file" and point to the pub file which was created earlier. Finally click on Import to create the Keypair.

Conclusion

Now we have created a Keypair in two regions. And both the regions have keypairs which have the same public/private key. So, we would be able to use the same Putty session when connecting to the EC2 instances in different regions. It's not a life saving hack, but it something interesting to know and saves a few seconds/clicks here and there.

Note that this approach is not recommended for production and sensitive setup as we are using the same Keypair across regions, but can definitely used when we are trying to learn AWS.

Tuesday, September 22, 2020

Connecting Lens IDE to K8S Cluster using port forwarding

In the previous blogs (1, 2), I mentioned about setting up K8S Cluster on laptop for the sake of experimenting. We should be able to connect to the Control Plane/Master instance and execute the kubectl commands to interact with the K8S cluster. For those who are new to K8S or not from technology background it might be a bit intimidating using the different options with kubectl, this is where Lens (K8S IDE) (1, 2) comes into play.

Lens is dubbed as K8S IDE and is a FOSS and can be integrated with multiple K8S Clusters at a time. Depending on the permissions, both Read and Write Operations are allowed on the K8S Cluster. As shown below, I had configured Virtual Machines for the K8S Cluster on the Laptop using VirtualBox.

'NAT Networking' was used for the VirtualBox networking as this allows to work in the offline mode, network communication across Virtual Machines and also access to the internet. The only caveat is that there is no direct connectivity form the Host Machines to the Guest Virtual Machines, port forwarding has to be used as mentioned in the documentation here (1).

Below is how the port forwarding has been configured in the VirtualBox global settings. The Host IP had been left out and will default to localhost. The port 27 from the localhost is pointing to the port 6443 on which K8S API Server is listening to. This is required for the Lens to connect to the K8S Cluster, rest of the rules are for connecting to the Virtual Machin Instances via SSH.

In the Lens, the ".kube/config" file from the K8S Control Plane/Master must be imported during setting up the Cluster. The ".kube/config" file didn't work as-is because port forwarding has been used and the X509 certificates are not valid for the localhost/127.0.0.1 IP address. Had to do two things.

(a) Generate the certificates on the Control Plane/Master as root using the below commands. Note that 10.0.2.101 is the IP address of the K8S Control Plane/Master on which the API Server is running. Thanks to the StackOverflow solution here (1).

rm /etc/kubernetes/pki/apiserver.*

kubeadm init phase certs all --apiserver-advertise-address=0.0.0.0 --apiserver-cert-extra-sans=10.0.2.101,127.0.0.1

docker rm -f `docker ps -q -f 'name=k8s_kube-apiserver*'`

systemctl restart kubelet

(b) And then modify the config file to point to 127.0.0.1:27, before importing and creating a K8S Cluster in Lens. Note that 27 is the port number configured in the VirtualBox port forwarding rules for the API Server.

Completing the above two steps allowed a connection from the Lens to the K8S API Server which is the single point of interface to the K8S Cluster. It took some time to figure it out, but it was interesting and fun. Below are some of the screens from the Lens around various dimensions.

(Details of the Control Plane/Master)

(Details of the Slave)

(Details of the nodes in the Cluster)

(Details of the Control Plane/Master)

(Overview of the workloads on the Cluster)

(Pods on the Cluster)

(DeamonSets on the Cluster)

(Services on the Cluster)

(Endpoints on the Cluster)

(Namespaces in the Cluster)

Likewise, it's possible to connect to multiple K8S Cluster from Lens and operate on them. Lens is context aware and automatically downloads the correct version of the kubectl from the Google K8S repository.

Conclusion

Lens is a nice K8S IDE and it's a nice way to get started with K8S and also very useful for those who are not that technology savvy to browse around different components in K8S. But, those who are familiar with K8S or have spent good amount of time it's a hinderance and would prefer executing the kubectl command. It's very much like using "vi" vs "notepad" for editing files. With the recent acquisition of Lens by Mirantis (1), we need to wait how Lens adds to the productivity.

Also, don't get too used to Lens, CKA and CKAD certifications don't allow the usage of Lens. Everything has be performed from the command line and one needs to be very familiar with vi/tmux and bunch of command line tools.

Saturday, September 19, 2020

Optimal VirtualBox network setting for K8S on Laptop

In one of the previous blog we looked at setting up K8S on a laptop. The advantages of this setup is the freedom to try out of different things and it is very quick to get started. On my laptop it takes about 5 minutes for the Virtual Machines to start including the K8S in them. The downside is it's mainly for learning things and doesn't take much load.

Recently I bought a new Lenovo ThinkPad and so had to go with the entire exercise of setting up K8S on it. BTW, pretty happy with the Laptop. The only gripe is that it comes with 8GB of RAM, need to upgrade it to 16GB, the maximum RAM it supports. The Laptop is very light and I can snug into any corner of the house to work with concentration easily.

Above is the setup on my previous Laptop, with one Control Plane (master) and two slaves. There had been a few problems with the VirtualBox networking. Different types of networking are supported by VirtualBox (1) and Bridged Networking was used. With Bridged Networking everything was working fine with the below problems.

- Had to be always connected to the network. Won't be able to work in the offline mode.

- Also, switching between the different networks will change the IP of the master and K8S would stop working.

As mentioned above there is more than one way of configuring the network in VirtualBox. The same can be seen in the Virtual Machine settings under Network tab.

Here(1) is a good article on the different types of networking in VirtualBox and details about them. On the Y-Axis we have the different types of networking and on the X-Axis the features they support. Let's narrow down to the type of networking we would like to use with VirtualBox by identifying the required features for having a K8S Cluster on the Laptop.

-- "VM <--> VM" -- Required for communicating across VM instances.

-- "VM <- Host" -- Required as we need to connect from the Host OS to the Guest for debugging etc.

-- "VM --> LAN" -- Required for the internet connection to download the different softwares

-- "VM --> Host" -- Is optional for connecting from the Virtual Machine to Host

-- "VM <-- LAN" -- Is optional for accessing the K8S Cluster from outside the Laptop

From the feature matrix and the required features, the only options left around the VirtualBox networking are NAT Network and Bridged Networking. The problem with the Bridged networking is that as mentioned above, it always requires connection to the network and switching to a different network changes the IP of the K8S master and breaks down the entire setup. The certificates during the K8S setup are tied to a specific IP and need to generated again each time the IP address of the master changes (1). This is not impossible, but is tedious every time we change the network and the IP address of the master changes. So, the only optimal option left is to use the NAT Network.

With the combination of the NAT Network in VirtualBox and using static IP address in guest Virtual Machines, we don't need to worry about changing from one network to another as the VirtualBox NAT Network has a DHCP component and an IP address from it can be configured as Static IP for the Guest Virtual Machines. Also, a Virtual Switch would be used for the communication across the different guest Virtual Machines and there is no need to be connected to the network. This ensures that we can work in the offline mode with K8S on the laptop even we are on the move. Below are the different components while using the VirtualBox NAT Network and how the network communication happens. Highlighted in the red is how the network communication happens.

The only catch with the NAT Network is that we won't be able to connect to the guest Virtual Machines directly without doing any port forwarding as mentioned in the VirtualBox documentation here (1). The documentation mentions NAT, but the same applies to the NAT Network also. This is a not a big issue, but is a matter of configuring the VirtualBox with "Port Forwarding Rules" before connecting to the guest Virtual Machines.

In a future blog, I will provide the binaries and the steps to easily setup K8S on a laptop. But, for now I took a screenshot of the Memory usage before and after starting the Virtual Machines on the laptop.

(Before)

(Starting the Virtual Machines with K8S)

(After)

(Laptop CPU and RAM)

Within 4 to 5 minutes, I was able to login to the K8S master and able to get the list of nodes and the pods using the kubectl command.

Conclusion

To conclude setting up K8S is not a hard task, but requires a bit of patience for the installation of the OS, softwares, configurations and finally cloning the Virtual Machines, so as to avoid repetition of tasks and saving time. Also "VirtualBox NAT Network" is the best option in the network type as this enables to work in the offline mode and doesn't break the K8S setup while switching between networks.

As mentioned I would be uploading the Virtual Machines Images and would be detailing the procedure for setting up K8S on a Laptop. But, I need to zip and uploads huge files, so it might take some time.

Pages

Wednesday, October 14, 2020

Thursday, October 8, 2020

Setting up an EC2 user with username/password authentication

Setting up an EC2 user with Keypair authentication

Tuesday, October 6, 2020

Thursday, October 1, 2020

Tuesday, September 29, 2020

Tuesday, September 22, 2020

Saturday, September 19, 2020