Big Data and Cloud Tips: Viewing Hadoop code in Git

Thursday, September 29, 2011

Viewing Hadoop code in Git

Git is an open-source distributed version control system. Distributed means that the developer would be getting a local copy of the entire development history along with the code (it's like cloning an entire remote repository locally). Because of the local data, Git is much faster compared to other version control systems. With Subversion and other version control systems, only the source code will be copied locally and the central server has to be contacted for the development history.

Some of the advantages of Git are

1) working offline when there is limited or no network connection

2) browsing through the source code without explicitly downloading the code for a trunk or a branch/tag
3) checking the history of the trunk/tags/branches offline

4) when making a lot of changes to the Hadoop code and to keep them under version control locally

Hadoop is evolving at a very rapid pace and it would be beneficial to configure Git with Hadoop. We will go through the instructions for (1) and (2) and leave (3) and (4) for a later blog entry. The below instructions are for Ubuntu 11.04 and might slightly differ for other OS.

Installing Git

Git is not installed by default in Ubuntu 11.04 and can be installed from the command line or using 'Synaptic Package Manager'. Root permissions are required to install Git.

sudo apt-git install git

will install git.

'which git' will give the location of Git and also ensure that Git is installed properly.

Getting the code from Apache

Apache provides read-only Git mirrors of the Apache codebases. So, the code cannot be submitted through the Git mirrors to Apache.

git clone git://git.apache.org/hadoop-common.git

will copy the code along with the development history locally. A directory 'hadoop-common/.git' will be created which will contain all the information for Git to function.

'git clone' command will currently download about 150 MB of files and will take time based on the network bandwidth. This is a one-time download and all the updates to the code will be much faster.

All the commands from now on should be run from hadoop-common directory or one of it's sub-directory.

How to know the current branch?

git branch

will list all the local branches and current working branch will have a star beside it. By default, the 'trunk' is the current working branch.

Viewing the code for a particular branch

git checkout branch-0.23

will switch to the new branch context. Means, the code in the directory will be for the branch specified in the 'git checkout' command.

Getting the latest code from Apache

git pull

will fetch the data from the remote repository and merge it locally for the current branch you are in (the one with a * in the output of the 'git branch' command).

for i in `git branch | sed 's/^.//'`; do git checkout $i ; git pull; done

will fetch and merge for all the local branches ('git branch' command will give list of all the local branches).

In the coming blog entry, we will go through the development history of Hadoop using Git.

References

Git Home Page

Basic information about some of the Git commands

Git and Hadoop Wiki

Introduction to Git with Scott Chacon of GitHub - Scott (author of Pro Git) seems to be in a hurry in the video :)

Big Data and Cloud Tips

Pages

Thursday, September 29, 2011

Viewing Hadoop code in Git

No comments:

Post a Comment