Big Data and Cloud Tips: September 2011

Friday, September 30, 2011

Resources for NextGen MapReduce

Edit: For easier access I have moved this to the pages section just below the blog header and no more maintaining this entry.

'Next Genereation MR' or 'NextGen MR' or 'MRv2' or 'MR2' is a major revamp of the MapReduce engine and will part of the 0.23 release. MRv1 or the old MapReduce engine will be not be supported in 0.23 release. The underlying engine has been revamped in 0.23, but the API to interface with the engine remains the same. So, the existing MapReduce code for MRv1 engine should run without modifications on MRv2.

The architecture, information for building and running MRv2 is spread across and this blog entry will try to consolidate and present all the information available on MRv2. I will keep-on updating this blog entry as I get more information about MRv2, instead of creating a new one. So, bookmark this and check it often :).

Current Status

http://www.hortonworks.com/update-on-apache-hadoop-0-23/ - 27th September, 2011

http://www.cloudera.com/blog/2011/11/apache-hadoop-0-23-0-has-been-released/ - 15th November, 2011

http://hortonworks.com/apache-hadoop-is-here/ - 16th November, 2011

Home Page

http://hadoop.apache.org/common/docs/r0.23.0/

Architecture

The Hadoop Map-Reduce Capacity Scheduler

The Next Generation of Apache Hadoop MapReduce

Next Generation of Apache Hadoop MapReduce – The Scheduler

Detailed document on MRv2

Presentation

Quick view of MRv2

JIRAs

https://issues.apache.org/jira/browse/MAPREDUCE-279

Videos

Next Generation Hadoop MapReduce by Arun C. Murthy

Code

http://svn.apache.org/repos/asf/hadoop/common/branches/branch-0.23/

Building from code and executing and running a sample

http://svn.apache.org/repos/asf/hadoop/common/branches/branch-0.23/BUILDING.txt
http://svn.apache.org/repos/asf/hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/INSTALL

Thursday, September 29, 2011

QGit - GUI for Git

Once the remote Apache Hadoop repository has been cloned locally, now it's time to use a GUI to walk through the development history of the trunk and the different branches/tags. Gitk (which comes with the installation of Git), EGit (Eclipse Plugin for Git) and QGit are some of the UIs for Git. QGit is more intuitive than others. We will go through the steps for installing, configuring and using QGit.

Installing QGit

QGit doesn't come with the default installation of Ubuntu and can be installed from the command line or using 'Synaptic Package Manager'. Root permissions are required to install QGit.

sudo apt-get install qgit

will install QGit. QGit can be started from the menu (Applications -> Programming -> qgit) or using 'qgit' command from the command line.

Configuring and using QGit

Checkout the branch for which the history has to be viewed in QGit using the 'git checkout <branch>' command. The output of the 'git branch' command shows the selected branch with a *.

To view the Git repository in QGit

- Start QGit from the menu or command line.
- Goto 'Edit -> Setting' and unselect the 'Show range select when opening a repository' and 'Reopen last repository on startup' options.

- Select 'File -> Open .....' and select the hadoop-common folder which was created with 'git clone git://git.apache.org/hadoop-common.git' command.

- Select 'View -> Toggle tree view' to show the 'Git tree' on the left side. The 'Git tree' pane on the left shows you the state of the tree at the commit that has been selected in the 'Rev list' pane.

- Navigate to a file in that tree and double click on it to see the content of the file at that version and an indication of who most recently changed each line before that version.

This covers the basics for viewing the development history of a specific branch in QGit.

Happy Gitting :)

Viewing Hadoop code in Git

Git is an open-source distributed version control system. Distributed means that the developer would be getting a local copy of the entire development history along with the code (it's like cloning an entire remote repository locally). Because of the local data, Git is much faster compared to other version control systems. With Subversion and other version control systems, only the source code will be copied locally and the central server has to be contacted for the development history.

Some of the advantages of Git are

1) working offline when there is limited or no network connection

2) browsing through the source code without explicitly downloading the code for a trunk or a branch/tag
3) checking the history of the trunk/tags/branches offline

4) when making a lot of changes to the Hadoop code and to keep them under version control locally

Hadoop is evolving at a very rapid pace and it would be beneficial to configure Git with Hadoop. We will go through the instructions for (1) and (2) and leave (3) and (4) for a later blog entry. The below instructions are for Ubuntu 11.04 and might slightly differ for other OS.

Installing Git

Git is not installed by default in Ubuntu 11.04 and can be installed from the command line or using 'Synaptic Package Manager'. Root permissions are required to install Git.

sudo apt-git install git

will install git.

'which git' will give the location of Git and also ensure that Git is installed properly.

Getting the code from Apache

Apache provides read-only Git mirrors of the Apache codebases. So, the code cannot be submitted through the Git mirrors to Apache.

git clone git://git.apache.org/hadoop-common.git

will copy the code along with the development history locally. A directory 'hadoop-common/.git' will be created which will contain all the information for Git to function.

'git clone' command will currently download about 150 MB of files and will take time based on the network bandwidth. This is a one-time download and all the updates to the code will be much faster.

All the commands from now on should be run from hadoop-common directory or one of it's sub-directory.

How to know the current branch?

git branch

will list all the local branches and current working branch will have a star beside it. By default, the 'trunk' is the current working branch.

Viewing the code for a particular branch

git checkout branch-0.23

will switch to the new branch context. Means, the code in the directory will be for the branch specified in the 'git checkout' command.

Getting the latest code from Apache

git pull

will fetch the data from the remote repository and merge it locally for the current branch you are in (the one with a * in the output of the 'git branch' command).

for i in `git branch | sed 's/^.//'`; do git checkout $i ; git pull; done

will fetch and merge for all the local branches ('git branch' command will give list of all the local branches).

In the coming blog entry, we will go through the development history of Hadoop using Git.

References

Git Home Page

Basic information about some of the Git commands

Git and Hadoop Wiki

Introduction to Git with Scott Chacon of GitHub - Scott (author of Pro Git) seems to be in a hurry in the video :)

Wednesday, September 28, 2011

Theme of the blog

Hadoop is an open-source Java framework for distributed processing of large amounts of data. Hadoop is based on MapReduce programming model published by Google. As you browse through the web, there is a better chance that you are touching Hadoop or MapReduce model in some way.

The beauty of open-source is that the framework is open for anyone to use and modify it to their own requirement with enough commitment. But, the big challenge in adopting Hadoop or in fact any other open-source framework is the lack of documentation. Even if it's there, it might be sparse or stale. And, sometimes no documentation is better than incorrect or outdated documentation.

As I journey through learning Hadoop, I will blog tips and tricks for the development, debugging and usage of Hadoop. Feel free to comment on the blog entries for any corrections or any better way of doing things

Pages