Friday, October 28, 2011

Hadoop and MapReduce Algorithms Academic Papers

Hadoop in spite of starting in a web based company (Yahoo), has spawned to solve problems in many other disciplines. Amund has consolidated list of 'Hadoop and MapReduce Algorithms Academic Papers'. This will give an idea where all Hadoop and MapReduce can be used, some of them can be ideas for projects also. The Atbrox Blog is maintaining this list.

Nice blogs on Hadoop and Distributed Computing

Thursday, October 27, 2011

'Hadoop - The Definitive Guide' Book Review

For those who are interested and serious in getting into Hadoop, besides going through the tons of articles and tutorials on the Internet, 'Hadoop - The Definitive Guide' (2nd Edition) by Tom White is a must have book. Most of the tutorials stop with the 'Word Count' example, but this book goes into the next level explaining the nuts-n-bolts of the Hadoop framework with a lot of examples and references. The most interesting and important thing is that the book also mentions why certain design decisions where made in Hadoop.

Not only the book covers HDFS and MapReduce, but also gives an overview of the layers which sit on top of Hadoop like Pig, Hive, HBase, ZooKeeper and Sqoop.

The book could definitely have the following
  • MapReduce is covered in detail, but HDFS internals and fine-tuning are at a high-level.
  • Also, to be in sync with Hadoop development and features, it's absolutely necessary to get source from trunk or from another branch and build, package and try it out.
  • NextGen MapReduce, HDFS Federation and a slew of other features which are being released as part of Hadoop Release 0.23.

The 3rd Edition of same book is due on April 30th, 2012 and looks like it has more case studies as well as new material on MRv2. The 3rd Edition of the book is worth waiting, but for the impatient who want to get started immediately the 2nd Edition is a must have.

Saturday, October 22, 2011

Hadoop on Windows

As some of you might have read HortonWorks and Microsoft have partnered, to get Hadoop running on Windows. Till date, Hadoop is being run only on Linux in production, but on Windows and Linux for development. In the future, we would also be seeing Hadoop on Windows in production.

- It's not the first time Hadoop and Microsoft came together. Microsoft acquired semantic search engine PowerSet, which is now part of the Bing search engine. PowerSet internally used Hadoop. Later read that, Hadoop has been replaced with some other software by Microsoft after acquisition (disclaimer : not 100% sure about it).

- Then there is Dryad (platform for distributed computing) and DryadLINQ (high level abstraction language for distributed computing) from Microsoft. DryadLINQ is tightly integrated with .NET and Windows and would be running much more efficiently on Windows than Hadoop on Windows. Not sure, if Microsoft will give enough focus on Hadoop along with Dryad.

- Apache Hadoop documentation recommends Oracle JDK 6. Apache Hadoop unpatched doesn't run neither on IBM JDK/defunct Apache Harmony/Open JDK, now Windows is being added to the mix.

- I am not a performance expert on cross-platform applications, but it might be a challenge to make same version of Hadoop perform better on Linux and Windows at the same time.

- The one good thing about all of this is Microsoft would be contributing the code back to Apache and there would be more eyes looking at the Hadoop code. Also, Microsoft is having it's employees to work on Hadoop and not outsource it.

- Also, as Steve mentioned there is a very little chance that Hadoop on Windows will be deployed for internal use. So, someone outside has to step-up and deploy/find bugs in Hadoop on a big cluster.

Considering all there factors, let's wait and see if there would be more Hadoop on Linux or Windows.

Edit: I was a bit skeptical about Microsoft's commit to Hadoop. But, looks like Microsoft is jumping into Hadoop all the way. This is a good news for Hadoop.

Edit (13th December, 2011) : Microsoft to allow limited preview on Hadoop on Azure.

Edit (15th December, 2011): The following url point to WIP documentation for Hadoop on Windows.

Edit (12th Jaunary, 2012): Avkash had been passionately blogging about Hadoop on Windows Azure.

Oh My God - Ubuntu !!!!

I have been using Ubuntu for more than 3 years and had been actively using it for the last 1 year, since I am working more and more on Hadoop. Ubuntu was supposed to be the flavor of Linux which is easy to install and runs out-of-the-box with minimum additional softwares/configurations required. Some of the features which have been there in 11.04 have been missing in 11.10. Here are some of the gripes I have, these may not be significant for a Linux geek, but may be significant for someone who wants to get started with Ubuntu for the first time.

- Screen saver has been removed in 11.10, there is a blue-print (smile) for a new screen saver. When something is not ready for 11.10, then why remove the existing working software. To get the screen saver running in 11.10, a couple of softwares had to be installed/removed and an application had to added to the start-up.

- User passwords can be changed from UI, but to make a user member of a group the terminal has to be used. Reason to use the CLI.

- After installing 11.10, the machine was a bit slower than 11.04. So, I used 'Startup Applications' to see the applications bought during boot. It did have only one in the list, which didn't make much sense. Later came to know that the startup applications were picked from /etc/xdg/autostart application and these are not shown in the 'Startup Applications'. Another reason to use the CLI.

- There is no easy way to add/change the order of icons to the Unity Launcher. A couple of files have to be tweaked to get the 'quick list' in the launcher.

- With GNome 2x, launching a new application instance was just a matter of clicking the icons in the panel. With Unity, it takes two clicks to launch a new application instance. This reminds me of something which took something like 6-7 clicks in Windows XP to change something really simple.

- There are some nice features like integration with Open Stack. But, it's not worth a dime to someone who wants to do some basic stuff like chatting, browsing etc.

I am not against CLI, in fact I write shell scripts to automate tasks on a regular basis. But, I find it disturbing that some of the basic features have been removed from Ubuntu 11.10 and it takes more clicks to perform some action in Unity than in GNome 2.

Hope 12.04 is better !!!