Friday, March 30, 2012

Hadoop Vs Hama

As I mentioned earlier getting started with Hama is very easier to use for those who are familiar with Hadoop. Much of the ideas and code have been borrowed from Hadoop.


Hama also follows master/slave pattern. Above picture conveys the mapping between Hadoop and Hama. A JobTracker maps to a BSPMaster, TaskTracker maps to a GroomServer and Map/Reduce task maps to a BSPTask.

The major difference between Hadoop and Hama is while map/reduce tasks can't communicate with each other, BSPTask can communicate to each other. So, when a job requires multiple iterations (as in the case of graph processing) the data between iterations has to be persisted and read back when MapReduce (Hadoop) is used. This is not the case with BSP (Hama) as the tasks can communicate with each other. This leads to better efficiency as the overhead of disk writing and reading is avoided.

Here are some more details about the Hama architecture.

Tuesday, March 27, 2012

HortonWorks Webinars

HortonWorks has scheduled a series of Webinars in the month of April and May

- Simplifying the Process of Uploading and Extracting Data from Hadoop

- HDFS Futures: NameNode Federation for Improved Efficiency and Scalability

- Improving Hive and HBase Intergration

Saturday, March 24, 2012

BSP (Hama) vs MR (Hadoop)

BSP (Bulk Synchronous Parallel) is a distributed computing model similar to MR (MapReduce). For some class of problems BSP performs better than MR and other way. This paper compares MR and BSP.

While Hadoop implements MR, Hama implements BSP. The ecosystem/interest/tools/$$$/articles around Hadoop is much more when compared to Hama. Recently Hama has released 0.4.0 and includes examples for page rank, graph exploration, shortest path and others. Hama has many similarities (API, command line, design etc) with Hadoop and it shouldn't take much time to get started with Hama for those who are familiar with Hadoop.

As of now Hama is in incubator stage and is looking for contributors. Hama is still in the early stage and there is a lot of scope for improvement (performance, testing, documentation etc) . For someone who wants to start with Apache, Hama is a choice. Thomas and Edward had been actively blogging about Hama and are very responsive for any clarifications in the Hama community.

I am planning to spend some time on Hama (learning and contributing) and would keep posting on this blog on the progress.

Ubuntu - Syncing packages across multiple machines

I have multiple Ubuntu machines and couple of times I had to install Ubuntu from scratch either because the Ubuntu upgrade was not smooth or I messed it up. In either case after installing the Ubuntu, all the additional softwares had to be installed. I used to maintain a command similar to below in Dropbox
sudo apt-get install bum nautilus-dropbox vlc gufw ubuntu-restricted-extras nautilus-open-terminal pidgin freemind gimp subversion autoconf libsvn-java nethogs gedit-plugins skype okular wine1.3 artha alarm-clock git gnome-shell ssh qgit virtualbox virtualbox-guest-additions-iso kchmviewer id3v2 g++ lxde chromium-browser
`Ubuntu Software Center` which is installed by default in Ubuntu has the feature (File -> Sync Between Computers ....) for syncing installed packages across multiple Ubuntu machines. The list of packages installed is maintained online.

Wednesday, March 21, 2012

My new Nano

I bought the 1st generation Apple Nano from BestBuy some 7-8 years back and it was gathering dust because of my iPod. Apple recalled the old Nano because it was posing safety risk and a month back I got an 8GB silver Nano as a replacement. It took more than 3 months for Apple to replace it, but I got the new Nano a month back and extremely happy with it.


It works fabulous, but the only gripe I have is that there is no iTunes for Linux. When I tried to sync the Nano with Banshee, the Nano was crashing and had to restore it on a Windows machine. I need to spend some time on how to get sync to work on a Ubuntu machine.

BTW, my 4 year old son is very comfortable using LG Optimus One (Andriod) and iPAD (iOS) to play music, watch pictures and play games. The way many of us grew with bulky monochrome monitors, RAM in MBs, HDD in single digit GBs - many of the kids are growing with smart phones and tablets. When they grow and get a chance to decide for a  tablet or a pc, there is no doubt on what they will pick.

HDFS Facts and Fiction

Sometimes even the popular blogs/sites don't get the facts straight, not sure if articles are reviewed by others or not. As I mentioned earlier, changes in Hadoop are happening at a very rapid pace and it's very difficult to keep updated with the latest.

Here is a snippet from ZDNet article

> The Hadoop Distributed File System (HDFS) is a pillar of Hadoop. But its single-point-of-failure topology, and its ability to write to a file only once, leaves the Enterprise wanting more. Some vendors are trying to answer the call.

HDFS supports appends and that's the core of HBase. Without HDFS append functionality HBase doesn't work. There had been some challenges to get appends work in HDFS, but they have been sorted out. Also, HBase supports random/real-time read/write access of the data on top of HDFS.

Agree that NameNode is a SPOF in HDFS. But, HDFS High Availability will include two NameNodes and for now switchover from the active to the standby NameNode is a manual task to be done by an operator, work is going on to make this automatic. Also, there are mitigations around SPOF in HDFS like having a Secondary NameNode, writing the NameNode meta data to multiple locations.

One thing to ponder if HDFS is so unreliable, we wouldn't have seen so many cluster using HDFS. Also, on top of HDFS other Apache frameworks are also being built.

Thursday, March 15, 2012

How easy is it to use Hadoop?

This article made me think how easy it is to setup Hadoop. Setting up Hadoop on a single/multiple nodes and running MR jobs is not a big challenge. But, getting it to run efficiently and securely is a completely different game. There are too many Hadoop parameters to be configured, some of which are not documented. Now, add the different hardware configurations, different Hadoop versions and lack of documentation to the mix.

All the Hadoop production clusters have a separate team who are very familiar with the Hadoop code and know it in and out. Hadoop is evolving at a very fast pace and it's like a moving target to keep updated with the changes in the Hadoop code. It's not just possible to download the Hadoop binaries and use as-is in production efficiently.

I believe in the potential of Hadoop and don't want to deter those who wanted to get started with Hadooop. This blog is all about to make it easy for those who are getting started with Hadoop. Things will change as more and more vendors are getting in the Hadoop space and as they are contributing code to Apache. But, for now Hadoop is a beast and there is and will be huge demand for Hadoop professionals.

Friday, March 9, 2012

Home partition under Ubuntu

As mentioned in the previous blog I was playing with LXDE as a replacement for Unity Desktop Environment for Ubuntu. For some reason (after installing/uninstalling some components of LXDE) the Notebook was not booting into either Unity or LXDE, reinstalling LXDE again was of no use. Luckily, I created a separate partition for home and it was a matter of just reinstalling Ubuntu 11.10 under root (/) from CD, updating it (sudo apt-get update;sudo apt-get upgrade) and installing the required softwares (sudo apt-get install ......). The data in the home partition was intact.

In the past I had experiences where Ubuntu upgrade was not smooth and had to re-install Ubuntu loosing all the data because home didn't exist in a  separate partition. Ubuntu has a 6 months release cycle and it is recommended to freshly install Ubuntu instead of upgrading, along with having a separate partition for home.

I haven't tried out, but having a separate partition for home might also help to share the user data among multiple OS on a single machine.

Thursday, March 8, 2012

LXDE Desktop on Ubuntu 11.10



Ubuntu ships with Unity which was OK until I installed LXDE (Lightweight X11 Desktop Environment) using the following command

sudo apt-get install lxde

Although LXDE is a bit rough, it's way faster than Unity or GNome. On the same Notebook Unity took ~232 MB, while LXDE took ~140 MB of RAM. With a 4 GB RAM Notebook, a 72 MB saving of RAM won't make much of a difference. But, the snappiness of it makes me love and stick to LXDE.

LXDE site also claims that it requires less energy to perform tasks to other systems on the market, so I would be able to run on the battery for more time which is another plus. Noticed that LXDE boots ~ 20-25 seconds faster than Unity.

When Ubuntu introduced Unity, it took some time to get used to it, same is the case with LXDE also.

Thanks to howtogeek for the tip and the instructions.