Monday, December 30, 2013

Getting started with Big Data - Part 2 - Installing Ubuntu on top of VirtualBox

In an earlier blog entry we looked into how to install VirtualBox on top of Windows, now it's time to install Ubuntu 12.04 64-bit Desktop which is latest LTS release as of this writing.

The reason for using 64-bit is that 64-bit OS addresses more RAM than a 32-bit OS and that some of the Big Data frameworks like Impala can be only deployed on a 64-bit OS. An Ubuntu Server edition could have been used instead of a Ubuntu Desktop edition, but the Desktop edition provides a nice UI (defaults to Unity) which makes it easy for those who want to get started with Linux.

- The first step is to download Ubuntu 12.04 and not the point release 12.04.3. There are some problems installing the Ubuntu 12.04.3 point release as mentioned here. The `ubuntu-12.04-desktop-amd64.iso` image can be downloaded from here or one of the alternate sites.

The md5 hash of the image is `128f0c16f4734c420b0185a492d92e52` which can be got once the iso file has been downloaded. Here is how to get the md5 hash for different OS. The md5 hash makes sure that the iso file which has been downloaded is not corrupted or tampered with.

- Start the VirtualBox and click on the `New` on top left.
Screen 1

- Give the VM a name and select `Linux` as the type and `Ubuntu (64-bit)` as the version.
Screen 2

Watermarking images using Phatch

I am a big fan of automating things and Ubuntu (or Linux) gives me more and more opportunities for the same. It makes me content automating mundane tasks and am willing to spend time to get it done.

Recently, I was looking on how to watermark the images on this blog and was exploring not only how to watermark, but also how to automate the process given a bunch of images. Started exploring GIMP Script-Fu which is similar to Excel Macros to automate tasks. But, given a simple task like watermarking, it was more of a big bite to learn a new language Script-Fu.

After a bit of googling, finally found Phatch. Initially, was not sure what Phatch means, but found it is a sort of acronym for PHoto & bATCH. Actions can be created using Phatch UI and can be chained together as shown below. The work-flow definition will be automatically generated as shown in this file.
The workflow gets all the images in a particular folder, applies the specified watermark to each of the image and finally stores the images in a `wm` folder. Note the below image is with a watermark at the bottom right.
Phatch can also be run from the command line as below specifying the phatch file which contains the work-flow definition and the location of the images which needs to be watermarked.
phatch wm.phatch /home/praveensripati/images/
There are a lot of obscure softwares like these for Linux which are far more better than the windows counterpart. It's all a matter finding them. Would be blogging about more productivity softwares/tools like these in the future blogs.

One of the reason many say for not moving to the Linux OS is the availability of the softwares, but this is far from true. Ubuntu repository has a lot of softwares and only some of them are installed by default during the installation process. Synaptic Package Manager which is a front-end to apt-get can be used to manage the softwares in Ubuntu. So, given a task to accomplish, it's a matter of picking the appropriate software.

More and more games are also getting into the Linux platform, so it's not a bad bet after all :)

Saturday, December 28, 2013

Getting started with Big Data - Part 1 - Installing VirtualBox on a Windows Machine

If not all, most (1, 2 etc) of the Big Data frameworks get built for Linux platforms and then later some of them are migrated to the Windows platform as a second thought. It's not something new, but Microsoft has partnered with HortonWorks to make sure that Hadoop and other Big Data frameworks work smoothly on the Windows platform. Microsoft had Dryad which is a distributed computing platform which had been abandoned for Hadoop. Not only Hadoop is being ported to Windows as HDP, but one can also expect tight integration with other Windows services/softwares.
linux_not_windows by nslookup on flickr
There are a number of factors which decide to go for Big Data/Windows or Big Data/Linux platform. Even when going for Big Data/Windows, it makes sense to have a Big Data/Linux for the development environment from a cost perspective. It's more or less the same as developing JEE applications JBoss (ok ok - WildFly) and then migrating them to other proprietary application platforms like IBM WAS. Although, it comes with a cost saving to do the development on Linux machines, there might be an additional effort which might have to be incurred to take into the fact that there are some (or a lot of) environment differences between Big Data/Windows or Big Data/Linux platforms. Also, any extensions provided in one platform cannot be developed/tested on the other platform.

As mentioned above, most of the Big Data frameworks get built on Linux and then ported to Windows platform. And it's really important to get initially familiar and then finally comfortable with Linux to deep dive into the Big Data space. Most of us are have worked on and comfortable with Windows, but it's  rare or less common to see someone who uses Linux on a day to day basis unless someone is working in IT or has been forced to use Linux by mass migrating from Windows to Linux.

One of the main reason for the hindrance to adopt Linux is the fear that installation of Linux might mess with the existing Windows installation. But, Linux has come long way and it has been no more easier installing and using Linux. There are more than 100 flavors of Linux to pick and choose from.

In this blog we will look into how to install VirtualBox on a Windows machine and with upcoming blogs on how to install Ubuntu and finally Apache Bigtop (which makes it easy to get started with a lot of Big Data frameworks easy) on top of Ubuntu. VirtualBox and other virtualization softwares like VMWare allow to run one OS on top of another OS or run multiple OS directly on the virtualization software.

Friday, December 27, 2013

Different kernel versions between Ubuntu 12.04 and 12.04.3 (point release)

`Ubuntu Tux floats` to work by danoz2k9 on flickr
Ubuntu not only provides a very usable flavor of Linux, but a very easy to remember naming convention (try to remember for Windows). This blog is being authored from Ubuntu 12.04 (Precise Pangolin), where 12 corresponds to the year 2012 and 04 corresponds to the month April when that particular version of Ubuntu has been released.

Ubuntu releases a new version every 6 months. So, after the 12.04 release there had been 12.10, 13.04 and 13.10 releases. Every fourth releases (10.04, 12.04, 14.04 etc) is a LTS (Long Term Support) and is supported for 5 years, while the non LTS releases are supported for only 9 months. This is the reason for sticking to the Ubuntu 12.04 release, because it is supported for 5 years and there is no need to upgrade the OS again and again till 2017.

The main disadvantage of sticking to an old release is that within a particular release Ubuntu usually provides security/stability updates, but not any updates with new features. For example, Ubuntu 12.04 comes with R (2.14.1) while the latest is R (3.0.2). Software can be upgraded to the latest version in Ubuntu by adding repositories and installing the softwares.

Similar to Service Packs in Windows, Ubuntu also provides point releases.
According to Mark Shuttleworth :

These point releases will include support for new hardware as well as rolling up all the updates published in that series to date. So a fresh install of a point release will work on newer hardware and will also not require a big download of additional updates.

As some of you might noticed from the blog, I use Ubuntu a lot for trying out different things around Big Data and create Ubuntu Virtual Machines quite often as shown below. In-fact, I keep a copy of the pristine Ubuntu 12.04 image which will not only help to avoid the installation steps, but will also help in brining a Guest OS ASAP.
Recently on creating an Guest OS using Ubuntu 12.04.3 (note the point release), it was observed that there was a kernel version mismatch between the Host (Ubuntu 12.04 - 3.2.0-53-generic) and Guest (Ubuntu 12.04.3 - 3.8.0-34-generic) in-spite of doing an upgrade on both of them multiple times. The kernel version can be found out using the `uname -r` command.

Installing the VirtualBox the Guest Additions on the Guest not only provides a better performance in the Guest, but also a better integration (shared folders, full screen, shared clipboard etc) between the Host and the Guest OS. But, because the Guest OS was using an updated version of the kernel (3.8.0-34-generic), there were some conflicts because of which the Guest Additions were not getting installed.

After installing Ubuntu 12.04 (not the point release) as the Guest OS and updating the packages, the VirtualBox Guest Additions were installed properly. It looks like for better Hardware support Ubuntu point release come with a different or a more updated kernel as part of the LTS Enablement Stacks. More about this here.

Thursday, December 26, 2013

Store and ETL Big Data in the Cloud with Apache Hive

Big Data and cloud storage paired with the processing capabilities of Apache Hadoop and Hive as a service can be an excellent complement to expensive data warehouses. The ever-increasing storage demands of big data applications put pressure on in-house storage solutions. This data is generally not stored in database systems because of its size. In fact, it is commonly the precursor to a data mining or aggregation process, which writes the distilled information into a database. However, for this process the constantly growing data collected from logs, transactions, sensors, or others has to be stored somewhere, inexpensively, safely, and accessibly.

Cloud Storage

Most companies can achieve two of the three attributes, e.g. safe and inexpensive (tapes) or safe and accessible (multi-location, data servers), or inexpensive and accessible (non-redundant server or network attached drive). The combination of the three requires economies of scale beyond most company's ability and feasibility for their data. Cloud providers like Amazon Web Services have taken on the challenge and offer excellent solutions like Amazon S3. It provides a virtually limitless data sink with a tremendous safety (99.999999999% durability), instant access, and reasonable pricing.

Utilizing Cloud Data

Scenarios for big data in the cloud consequently should consider S3 as a data sink to collect data in a single point. Once collected the data has to be processed. One option is to write custom software to load the data, parse it, and aggregate and export the information contained in it to another storage format or data store. This common ETL (Extract, Transform, Load) processing is encountered in data warehouses situations. So reinventing the wheel by developing a custom solution is not desirable. At the same time building or scaling a data warehouse for Big Data is an expensive proposition.

There is an alternative, which can ETL Giga-, Tera- and Petabytes of data with low information density cheaply and with mature tools to simplify the design and management of the ETL jobs. This is the domain of Apache Hadoop and Hive.

Accessing Cloud Data with Hive

Apache Hive offers a SQL-like interface to data stored on Hadoop's HDFS or Amazon S3. It does this by mapping data formats to tables and associating a directory with a table. We can then execute SQL-like queries against this data. Hive turns the queries into map reduce jobs for Hadoop to execute. Hadoop will read all the files in the directory, parsed based on the table definition, and the extracted data to process it according to the query.

A very simple example is a comma-separated file stored on S3:
1,10.34,Flux Capacitor,\N
2,5.00,Head phone,ACME Corp.
The '\N' is a Hive specific encoding and is read by Hive as NULL. An empty string would be read exactly as an empty string and not NULL.

A table definition in Hive to utilise this data would be:
    id INT,
    price DECIMAL,
    name STRING,
    manufacturer STRING)
LOCATION 's3n://your_bucket/some_where/';

Tuesday, December 17, 2013

Imapala vs Hive performance on Amazon AWS EMR

Apache Hive and Impala provide a SQL layer on top of the data in HDFS. Hive converts the SQL query into a MR plan and submits it to the Hadoop cluster for execution, while the Impala daemons directly process the data in HDFS. In an earlier blog we looked into when to use Hive and when Impala. Also, as mentioned in an earlier blog, both of them can use a common metastore, so any tables created in Hive would be visible in Impala and vice versa.
Hive on AWS EMR had been there for some time, but recently Impala has been added to the list. In this blog, we will compare the performance of Impala vs Hive on Amazon EMR using the same hardware (m1.large for master and m1.large for the slave) on the same data set. Note that a lot of improvements are in progress for both Hive and Impala and also EMR is using the latest versions of Hive and Impala as of now.

So, here are the steps at a high level:

- Launch a EMR cluster with Impala and Hive on AWS.
- ssh to the master and create some data.
- Create some tables using Impala shell and map it to the above data.
- Run the same queries on Hive and Impala to measure the performance.

And here are the steps in detail :

- Create the KeyPairs from the EC2 Management Console and the Access Keys (Access Key ID and Secret Access Key) in the IAM Management Console as mentioned here. While creating the KeyPairs the user will be prompted to store a pem file, which will be used to login to the master.

Monday, December 16, 2013

What's behind StackOverflow?

Some of us had been actively looking for help in StackOverflow and some actively helping others. Anyway, for those who had been working on any Technology, StackOverflow and the related sites have been a must-have. I know a couple of techie guys who have shifted to using StackOverflow from various mailing forums and other venues.

StackOverflow and a bunch of sites fall under the StackExchange umbrella. Here are the interesting details about the infrastructure and metrics behind the StackExchange. Thanks to the incredible team behind it to make it happen !!!

Friday, December 13, 2013

Unit Testing MapReduce with MRUnit framework
Distributed computing model programs like MapReduce are difficult to debug, so it's better to find and fix the bugs in the early stages of development. In an earlier post we looked into debugging a MapReduce in Eclipse and here we looked at unit testing Hadoop R Streaming programs. In this blog entry, we will look into how to unit test MapReduce programs using Apache MRUnit. In either case there is no need start HDFS and MR related daemons.

MRUnit is a Apache Top Level project and had been there for some time, but is not in the lime light and doesn't get much attention as the case of other projects, but is equally important to get the proper product out in time.

So here are steps:

- Download the below mentioned jar files.

mrunit-1.0.0-hadoop1.jar from
mockito-all-1.9.5.jar from
junit-4.11.jar from

- Copy the hadoop-core-1.2.1.jar, commons-cli-1.2.jar and commons-logging-1.1.1.jar files from the Hadoop installation folder into Eclipse and include them in the project dependencies.

- Create a project in Eclipse and add the above mentioned libraries as dependencies.

Tuesday, December 10, 2013

Running a custom MR job on Amazon EMR

In an earlier blog entry we looked at how to start an EMR cluster and submit the predefined WordCount example MR job. This blog details the steps required to submit a user defined MR job to Amazon. We will also look into how to open an ssh tunnel to the cluster master to get the access to the MR and HDFS consoles.

In Amazon EMR, there are three types of nodes. The master node, core node (for task execution and storage) and task node (for execution). We would be creating a master node, push the word count code to it, compile it, create a jar file, push the jar and data file to S3, then finally execute the MR program in the jar file. Note that the data can be in HDFS also.

- As described in the previous blog entry, the first steps is to create the `key pairs` from the EC2 management console and store the pem key file for later logging into the master node. Next create the Access Keys (Access Key ID and the Secret Access Key) in the IAM Management Console.

- Go to the S3 management console and create a bucket `thecloudavenue-mroutput`. Note that the bucket name should be unique, so create some thing else. But, this entry will refer to bucket as `thecloudavenue-mroutput`.

- Go to the EMR management console create a cluster. We would be creating a master node and not run any steps as of now.

- Give a name to the cluster and specify the logs folder location in S3. For some reason the logs are getting shown in S3 with a delay, so it had been a bit difficult to debug the MR jobs.

Friday, December 6, 2013

How to use the content of this blog?

As some of you have noticed, I have been blogging for the last 2 years about Big Data, Hadoop and related technologies. The blog was created with an intension to help those who wanted to get started with Big Data. There are a lot of interesting things happening in the Big Data space and I plan to keep blogging about them.

Some of the articles are precise and small, but it takes a lot of effort to get a grip on a subject and to express them in this blog. It's fun blogging and I would definitely encourage others to express themselves through blogging about the topic of their interest, not necessarily technology related.

There are not too many restrictions on how to use the contents of the blog. But, if you plan to use the content as-is or with minor modification else where like in another blog, I would really appreciate if a reference to the original content is made at the top as `This article originally appeared in ......".

Wednesday, December 4, 2013

New Virtual Machine to get started with Big Data

We offer a Virtual Machine (VM) for those interested in the Big Data training to get started easily. More details about the VM here. The VM avoids the burden of installing/configuring/integrating different Big Data frameworks on the developer. The VM can be easily installed on a Windows/Mac/Linux machine once it has been downloaded. Along with the VM, documentation with the instructions on how to use those different frameworks will also be provided.

The original VM had been created almost an year back and contained some outdated frameworks. So, we created a new VM for those who wanted to dive into the exciting world of Big Data and related technologies.

The old VM already had Hadoop, Hive, Pig, HBase, Cassandra, Flume, ZooKeeper which have been upgraded with the latest releases from Apache. The new VM has been built using Ubuntu 12.04 64-bit with the below mentioned  frameworks newly included.

- Oozie (for creating work flows and schedulers)
- Storm (for performing real time processing)
- Neo4J (for storing and retrieving graphs)
- Hadoop 2x (for executing MR on YARN)
- Phoenix (for SQL on top of HBase)
- rmr2 and rhdfs (for quickly developing MR in R)

Because of using FOSS, there is no limitation on how long the VM can be used. For more information about the Big Data Virtual Machine and the training, please send an email to

Tuesday, December 3, 2013

Getting started with Amazon EMR executing WordCount MR program

In an earlier blog entry, we looked into how easy it was to setup a CDH cluster on Amazon AWS. This blog entry will look into how to use Amazon EMR which is even more easier to get started with Hadoop. Amazon EMR is an PAAS (Platform As A Service) in which Hadoop is already installed/configured/tuned on which applications (MR, Pig, Hive etc) can be built and run.

We will submit a simple WordCount MapReduce program that comes with EMR. AWS has one of the best documentation, but there are some steps missing, this entry will try to fill them and add more details to it. The whole exercise takes less than 0.5 USD, so it should not be much much of a strain to the pocket.

Amazon makes it easy to get started with Hadoop and related technologies. All needed is a Credit Card to fork the money. The good thing is that Amazon has been slashing prices making it more and more affordable to everyone. The barrier to implement a new idea is getting low day by day.

So, here are the steps

- Create an account with AWS here and provide the Credit Card and other details.

- The next step is to goto the EC2 management console and create a Key Pair.
Download the pem file, it will be used along with ssh to login to the Hadoop nodes which we would be creating soon. To create an SSH tunnel to the Hadoop Master node, the same pem file would be required.