Thursday, November 24, 2011

Browsing the MRv2 code in Eclipse

MRv2 is a revamp of the MapReduce engine for making Hadoop reliable, available, scalable and also for better cluster utilization. MRv2 development had been under active development for some time and alpha are being released now. The Cloudera article is very useful for getting the code from SVN, building, deploying and finally running a Job to make sure Hadoop has been setup properly. Here are some additional steps

- Protocol Buffers is used as an RPC protocol between different daemons. The recomendation is to use Protocol Buffer version 2.4.1+. Some of the Linux releases don't have this version. So, Protocol Buffers code has to be downloaded, built and installed using the `configure, make, make install` command. `make install` will require administrative privileges. `protoc --version` will give the version number.

- In Ubuntu 11.10, g++ was not installed by default. `sudo apt-get install g++` installed the required binaries.

- As mentioned in the article, git can be used to get the source code. But, git pulls the entire Hadoop repository. Code for a specific version can also be pulled using the command `svn co source/`.

- Once the code has been successfully compiled using the `mvn clean install package -Pdist -Dtar -DskipTests` command, `mvn eclipse:eclipse` will build all the Eclipse related files and the projects can be imported as:

- Some of the projects may have errors, fix any missing jars and add source folders to the build path as required.

- Finally, add MR_REPO to the CLASSPATH variables in `Windows->Preferences`

- Now the projects should be compiled without any errors.

- Changes to the 0.23 branch are happening at a very fast pace. `svn up` will pull the latest code and `mvn clean install package -Pdist -Dtar -DskipTests` will compile the source code again. There is no need for a `mvn eclipse:eclipse` again.

Time to learn more about the Hadoop code :)

No comments:

Post a Comment