Big Data and Cloud Tips: 2011

Friday, December 30, 2011

WhatsWrong : Accumulating values sent to mapper

The below program accumulates the values sent to the mapper in valueBuff of type ArrayList<Text> (at line 21) and prints the ArrayList in the AccumulateKVMapper#cleanup method (at line 28).

import java.io.IOException;
import java.util.ArrayList;
import java.util.Iterator;

import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

public class AccumulateKV {
    static class AccumulateKVMapper extends
            Mapper<LongWritable, Text, LongWritable, Text> {

        public ArrayList<Text> valueBuff = new ArrayList<Text>();

        public void map(LongWritable key, Text value, Context context)
                throws IOException, InterruptedException {
            valueBuff.add(value);
            context.write(key, value);
        }

        protected void cleanup(Context context) throws IOException,
                InterruptedException {
            for (Text o : valueBuff) {
                System.out.println("value = " + o.toString());
            }
        }

    }

    public static void main(String[] args) throws Exception {
        if (args.length != 2) {
            System.err
                    .println("Usage: AccumulateKV <input path> <output path>");
            System.exit(-1);
        }

        Job job = new Job();
        job.setJarByClass(AccumulateKV.class);

        FileInputFormat.addInputPath(job, new Path(args[0]));
        FileOutputFormat.setOutputPath(job, new Path(args[1]));

        job.setMapperClass(AccumulateKVMapper.class);
        job.setOutputKeyClass(LongWritable.class);
        job.setOutputValueClass(Text.class);

        System.exit(job.waitForCompletion(true) ? 0 : 1);
    }
}

The input to the job is

Berlin-Germany
New Delhi-India
Kuala Lampur-Malaysia
Pretoria-Sri Lanka

The expected output is

value = Berlin-Germany
value = New Delhi-India
value = Kuala Lampur-Malaysia 
value = Pretoria-Sri Lanka

But, the output of the program is

value =
value =
value = 
value =

What's wrong with the above program? I have tried it against 0.22 release in Local (Standalone) Mode, but it should behave the same with other releases and also in Pseudo-Distributed and Fully-Distributed Mode.

Respond back in the comments and I will give a detailed explanation once I get a proper response.

Security in Hadoop

Hadoop security is often neglected and the information on the internet for securing a Hadoop cluster is also sparse.

It took me sometime to get around Hadoop security. So, included a separate page `Hadoop Security` with a quick summary and resources about Hadoop security at the top of the blog.

Will update `Hadoop Security` page as I get more comfortable and as more information is available on Hadoop security.

Sunday, December 25, 2011

Limiting the usage Counters in Hadoop

Besides the JobCounter and the TaskCounter counters which Hadoop framework maintains, it's also possible to define custom counters for application level statistics.

Counters can be incremented using the Reporter for the Old MapReduce API or by using the Context using the New MapReduce API. These counters are sent to the TaskTracker and the TaskTracker will send to the JobTracker and the JobTracker will consolidate the Counters to produce a holistic view for the complete Job.

There is a chance that a rogue Job creates millions of counters and since these counters are stored in the JobTracker, there is a better chance that JobTracker will go OOM. To avoid such a scenario the number of counters that can be created per Job are limited by the Hadoop framework.

The following code is in the Counters.java. Note that this code is in the 20.203, 20.204 and 20.205 (now called 1.0) releases. Also note that some of the parameters are configurable and some are not.

/** limit on the size of the name of the group **/
private static final int GROUP_NAME_LIMIT = 128;
/** limit on the size of the counter name **/
private static final int COUNTER_NAME_LIMIT = 64;

private static final JobConf conf = new JobConf();
/** limit on counters **/
public static int MAX_COUNTER_LIMIT = 
conf.getInt("mapreduce.job.counters.limit", 120);

/** the max groups allowed **/
static final int MAX_GROUP_LIMIT = 50;

In trunk and 0.23 release the below code is there in the MRJobConfig.java. Note that the parameters are configurable.

public static final String COUNTERS_MAX_KEY = "mapreduce.job.counters.max";
public static final int COUNTERS_MAX_DEFAULT = 120;

public static final String COUNTER_GROUP_NAME_MAX_KEY = "mapreduce.job.counters.group.name.max";
public static final int COUNTER_GROUP_NAME_MAX_DEFAULT = 128;

public static final String COUNTER_NAME_MAX_KEY = "mapreduce.job.counters.counter.name.max";
public static final int COUNTER_NAME_MAX_DEFAULT = 64;

public static final String COUNTER_GROUPS_MAX_KEY = "mapreduce.job.counters.groups.max";
public static final int COUNTER_GROUPS_MAX_DEFAULT = 50;

The above mentioned configuration parameters are not mentioned in the release documentation an so thought it would be worth mentioning as a blog entry.

I would like to call it a hidden jewel :)

Getting started with Hadoop in easy steps

For those who are interested in Hadoop, but are stuck with a Windows machine or have inertia to install Linux for different reasons, here are some alternatives.

a) Install Hadoop on Windows using Cygwin.

b) Wait for Microsoft to come out of the Hadoop preview.

c) Use Amazon EMR. Amazon has very clear documentation.

* d) Use a virtual image CDH from Cloudera which includes Hadoop.

We would be going through the last option of using CDH image from Cloudera in detail. CDH documentation can be found here.

The prerequisite for this is the installation of VirtualBox on the target machine. VirtualBox can be downloaded from here and here are the detailed instructions for installing VirtualBox. Also, an virtual image of CDH for VirtualBox has to be downloaded and unzipped. The downloaded file name would be similar to cloudera-demo-vm-cdh3u2-virtualbox.tar.gz.

Lets get started.

Step 1) Start VirtualBox and click on New (Ctrl-N).

Step 2) Click on Next.

Step 3) Select the options as below. Make sure to select `Red Hat (64 bit)` as CentOS is based on Red Hat and that the CDH image is 64 bit. Click on Next.

Step 4) Cloudera documentation says `This is a 64-bit image and requires a minimum of 1GB of RAM (any less and some services will not start). Allocating 2GB RAM to the VM is recommended in order to run some of the demos and tutorials`. Was able to run the Hadoop example with 1536 MB. Note that the memory allocated to the Guest OS can be changed later from within VirtualBox.

Step 6) Select `Use existing hard disk` and choose the CDH image unzipped earlier and click on Next.

Step 7) Make sure all the details are correct in the summary screen and click on Create.

Step 8) The VirtualBox home screen should look like below. Select CDH on the left pane and click on Start.

Step 9) After a couple of moments, the virtual image would have started along with all the daemons.

Step 10) To verify that the Hadoop daemons started properly, check that the number of TaskTracker nodes at http://localhost:50030/jobtracker.jsp and the number of DataNodes at http://localhost:50070/dfshealth.jsp is 1.

Also, the output of the `sudo /usr/java/default/bin/jps` command should be as below

2914 FlumeMaster
3099 Sqoop
2780 FlumeWatchdog
2239 JobTracker
2850 FlumeWatchdog
2919 FlumeNode
2468 SecondaryNameNode
2019 HMaster
3778 Jps
2145 DataNode
2360 NameNode
2964 RunJar
3076 Bootstrap
2568 TaskTracker

Step 11) Open a terminal and run the below command

hadoop --config $HOME/hadoop-conf jar /usr/lib/hadoop/hadoop-0.20.2-cdh3u2-examples.jar pi 10 10000

On the successful completion of the Job the output should look as below.

Note:

1) Some of the processors support Intel-VT and AMD-V for Hardware virtualization support. Most of the times this is disabled in the BIOS and has to be explicitly turned on.

2) Flume, Sqoop, Hive and Hue are also started besides the core Hadoop. These can be disabled by removing the x permissions for the files in /etc/init.d.

3) For purpose of stability/security the patches on CentOS can be updated using `sudo yum update` command.

4) For better performance and usability if the Guest (CentOS) VirtualBox Guest Additions have to be installed.

5) Hue (Web UI for Hadoop) can be accessed from http://localhost:8088/.

Friday, December 23, 2011

What is the relation between Hibernate and Swap Memory?

With the new HP 430 Notebook, hibernate was not working. Was getting a message that not enough swap for hibernate. Found from this Wiki that (swap memory >= RAM) for hibernate to work.

Since the HP 430 Netbook had enough RAM (4GB), I choose the swap to be 1GB at the time of Ubuntu installation and so hibernate was not working. Again the Wiki has instructions for increasing the size of the Swap.

So, it's better to choose enough swap at the time of the Ubuntu installation for hibernate to work.

New domain name for this blog

I own the www.thecloudavenue.com domain and had been using it for some other blog, which I had not been updating very actively. So, I am changed the domain name for this blog to use www.thecloudavenue.com.

Happy Hadoop'ing :)

Thursday, December 22, 2011

My new HP 430 Notebook

This week I bought a new HP 430 Notebook from Flipkart and am extremely satisfied with it. The Flipkart service was really awesome, would recommend to try it.

The Netbook was naked (without any OS), so I saved a couple of bucks on some proprietary OS. As soon as I got it, I installed Ubuntu 11.10 with all the updates. Since, I use apt-cacher-ng and maintain the list of software I use, setting up the machine was a breeze. A single `sudo apt-get install ........` installed all the required software.

Also, installed Java, Hadoop, Eclipse and the required software to try, learn and explore Hadoop. The Netbook has an i5 processor and supports Intel-VT, which had to be enabled in the BIOS. Also, installed Ubuntu 11.10 as guest using VirtualBox and tried a Hadoop cluster of 2 nodes (host and guest) with 0.22 release.

The regret I have is that I am able to suspend the Notebook in Ubuntu 11.10, but not able to resume it back. So, I have to hibernate it or shut it down. And though the Netbook has 4GB RAM, the graphics card doesn't have separate memory and is consuming it from the 4GB RAM.

Edit (24th January, 2012) - Noticed that the Notebook had a few bright spots and had to get it replaced. Although Flipkart had a 30 day replacement guarantee free of cost, the customer support got me in touch with the HP service center to get the Netbook monitor replaced instead of a full Notebook replacement, which was not what I wanted. Had to literally call Flipkart 15-20 times, send pictures of the Notebook monitor multiple times and finally had to escalate it to their supervisor before they agreed for a replacement of the Notebook.

Flipkart which is being called Amazon for India has a customer support which was behaving a lot cranky for the replacement of the laptop in spite of the damage. Flipkart is offereing competetive price with a COD (Cash On Delivery) and a 3 month EMI, so would be definitely recomendig to use the Flipkart service with care.

Edit (1st August, 2012) - Noticed that sometimes (very rare) the caps lock doesn't work and the network also doesn't work. Had to remove the battery from the laptop and put it back. Reboot doesn't work in both the cases. Looks like some state is stored in the laptop even when turned off, which gets reset when the battery is removed and put back.

Friday, December 16, 2011

What should be the input/ouput key/value types be for the combiner?

When only a map and reducer class are defined for a job, the key/value pairs emitted by the mapper are consumed by the by the reducer. So, the output types for the mapper should be the same as the reducer.

(input) <k1, v1> -> map -> <k2, v2> -> reduce -> <k3, v3> (output)

When a combiner class is defined for a job, the intermediate key value pairs are combined on the same node as the map task before sending to the reducer. Combiner reduces the network traffic between the mappers and the reducers.

Note that the combiner functionality is same as the reducer (to combine keys), but the combiner input/output key/value types should be of the same type, while for the reducer this is not a requirement.

(input) <k1, v1> -> map -> <k2, v2> -> combine* -> <k2, v2> -> reduce -> <k3, v3> (output)

In the scenario where the reducer class is also defined as a combiner class, the combiner/reducer input/ouput key/value types should be of the same type (k2/v2) as below. If not, due to type erasure the program compiles properly but gives a run time error.

(input) <k1, v1> -> map -> <k2, v2> -> combine* -> <k2, v2> -> reduce -> <k2, v2> (output)

Thursday, December 15, 2011

What is the difference between the old and the new MR API?

With the release of Hadoop 0.20, new MR API has been introduced (o.a.h.mapreduce package). There is not much of significant differences between the old MR API (o.a.h.mapred) and the new MR API (o.a.h.mapred) except that the new MR API allows to pull data from within the Map and Reduce tasks by calling the nextKeyValue() on the Context object passed to the Map function.

Also, some of the InputFormats have not been ported to the new MR API. So, to use the missing InputFormat either stop using the new MR API and go back to the old MR API or else extend the InputFormat as required.

How to start CheckPoint NameNode in Hadoop 0.23 release?

Prior to 0.23 release the masters file in the conf folder of Hadoop installation had the list of host names on which the CheckPoint NN has to be started. But, with the 0.23 release the masters file is not used anymore, the dfs.namenode.secondary.http-address key has to be set to ip:port in hdfs-site.xml. CheckPoint NN can be started using the sbin/hadoop-daemon.sh start secondarynamenode command. Run jps command to make sure that the CheckPoint NN is running and also check the corresponding log file also for any errors.

BTW, Secondary NN is being referred to as CheckPoint NN. But, the code is still using Secondary NN and people still refer it as Secondary NN.

Tuesday, December 13, 2011

Hadoop and JavaScript

Microsoft has announced limited preview version of Hadoop on Azure. JavaScript can also be used to write MapReduce jobs on Hadoop. As of now, Streaming allows any scripting language which can read/write from STDIN/STDOUT to be used with Hadoop. But, what Microsoft is trying make is make JavaScript a first class citizen for Hadoop. There is a session on `Hadoop + JavaScript: what we learned' end of February, which is too long for the impatient. BTW, here is an interesting article on using JavaScript on Hadoop with Rhino.

There had been a lot of work on JavaScript in the browser area for the last few years to improve the performance (especially V8).

Can anyone share use-cases or their experience using JavaScript for HPC (High Performance Computing) in the comments and I will update the blog entry accordingly?

Thursday, December 8, 2011

BigData vs Databases

Anant explains in an easy to understand manner. Here is his comparison of the Big Data and Databases. Suggest to subscribe his blog.

Monday, December 5, 2011

HDFS explained as comics

Manish has done a nice job explaining HDFS as comics for those who are new to HDFS. Here is the link.

Wednesday, November 30, 2011

Passing parameters to Mappers and Reducers

There might be a requirement to pass additional parameters to the mapper and reducers, besides the the inputs which they process. Lets say we are interested in Matrix multiplication and there are multiple ways/algorithms of doing it. We could send an input parameter to the mapper and reducers, based on which the appropriate way/algorithm is picked. There are multiple ways of doing this

Setting the parameter:

1. Use the -D command line option to set the parameter while running the job.

2. Before launching the job using the old MR API

JobConf job = (JobConf) getConf();
job.set("test", "123");

3. Before launching the job using the new MR API

Configuration conf = new Configuration();
conf.set("test", "123");
Job job = new Job(conf);

Getting the parameter:

1. Using the old API in the Mapper and Reducer. The JobConfigurable#configure has to be implemented in the Mapper and Reducer class.

private static Long N;
public void configure(JobConf job) {
    N = Long.parseLong(job.get("test"));
}

The variable N can then be used with the map and reduce functions.

2. Using the new API in the Mapper and Reducer. The context is passed to the setup, map, reduce and cleanup functions.

Configuration conf = context.getConfiguration();
String param = conf.get("test");

Tuesday, November 29, 2011

0.23 Project Structure

Prior to 0.23 release the project structure was very simple (common, hdfs and mapreduce). With 0.23 a couple of projects have been added for the ResourceManager, NodeManager, ApplicationMaster etc. Here is the structure of the 0.23 release drawn using Freemind.

Thursday, November 24, 2011

HDFS Name Node High Availability

NameNode is the heart of HDFS. It stores the namespace for the filesystem and also tracks the location of the blocks in the the cluster. The location of the blocks are not persisted in the NameNode, but the DataNodes report the blocks it has to the NameNode when the DataNode starts. If an instance of NameNode is not available, then HDFS is not accessible till it's back running.

Hadoop 0.23 release introduced HDFS federation where it is possible to have multiple independent NameNodes in a cluster, where in a particular DataNode can have blocks for more than one Name Node. Federation provides horizontal scalability, better performance and isolation.

HDFS NN HA (NameNode High Availability) is an area where active work is happening. Here are the JIRA, Presentation and Video for the same. HDFS NN HA was not cut into 0.23 release and will be part of later releases. Changes are going in the HDFS-1623 branch, if someone is interested in the code.

Edit (11th March, 2012) : Detailed blog entry from Cloudera on HA.

Edit (14th October, 2013) : Explanation on HA from HortonWorks.

Browsing the MRv2 code in Eclipse

MRv2 is a revamp of the MapReduce engine for making Hadoop reliable, available, scalable and also for better cluster utilization. MRv2 development had been under active development for some time and alpha are being released now. The Cloudera article is very useful for getting the code from SVN, building, deploying and finally running a Job to make sure Hadoop has been setup properly. Here are some additional steps

- Protocol Buffers is used as an RPC protocol between different daemons. The recomendation is to use Protocol Buffer version 2.4.1+. Some of the Linux releases don't have this version. So, Protocol Buffers code has to be downloaded, built and installed using the `configure, make, make install` command. `make install` will require administrative privileges. `protoc --version` will give the version number.

- In Ubuntu 11.10, g++ was not installed by default. `sudo apt-get install g++` installed the required binaries.

- As mentioned in the article, git can be used to get the source code. But, git pulls the entire Hadoop repository. Code for a specific version can also be pulled using the command `svn co http://svn.apache.org/repos/asf/hadoop/common/branches/branch-0.23/ source/`.

- Once the code has been successfully compiled using the `mvn clean install package -Pdist -Dtar -DskipTests` command, `mvn eclipse:eclipse` will build all the Eclipse related files and the projects can be imported as:

- Some of the projects may have errors, fix any missing jars and add source folders to the build path as required.

- Finally, add MR_REPO to the CLASSPATH variables in `Windows->Preferences`

- Now the projects should be compiled without any errors.

- Changes to the 0.23 branch are happening at a very fast pace. `svn up` will pull the latest code and `mvn clean install package -Pdist -Dtar -DskipTests` will compile the source code again. There is no need for a `mvn eclipse:eclipse` again.

Time to learn more about the Hadoop code :)

Tuesday, November 22, 2011

Interaction between the JobTracker, TaskTracker and the Scheduler

Scheduler in Hadoop is for sharing the cluster between different jobs, users for better utilization of the cluster resources. Also, without a scheduler a Hadoop job might consume all the resources in the cluster and other jobs have to wait for it to complete. With scheduler jobs can execute in parallel consuming a part of cluster.

Hadoop has a pluggable interface for schedulers. All implementations of the scheduler should extend the abstract class TaskScheduler and the scheduler class should be specified in the `mapreduce.jobtracker.taskscheduler` property (defaults to org.apache.hadoop.mapred.JobQueueTaskScheduler). The Capacity Scheduler, Fair Scheduler and other scheduler implementations are shipped with Hadoop.

The TaskTracker sends a heart beat (TaskTracker#transmitHeartBeat) to the JobTracker at regular intervals, in the heart beat it also indicates that it can take new tasks for execution. Then the JobTracker (JobTracker#heartbeat) consults the Scheduler (TaskScheduler#assignTasks) to assign tasks to the TaskTracker and sends the list of tasks as part of the HeartbeatResponse to the TaskTracker.

Sunday, November 20, 2011

Google Blogger Statistics SUCK !!!

In Google Blogger Statistics, only the top 10 referring sites are shown as below. As bots take over and with referral spam this feature is completely useless. Google should do a better job of handling the Statistics feature.

Checking the server log files for statistics of a self hosted blog is also a bit difficult because of the spam. It would be a good to have `reliable regularly updated publicly accessible list of spam sites`, which can be integrated with a script to get useful data out of the server logs.

Edit: Looks like Google Analytics is doing a good job of filtering the bots.

Thursday, November 17, 2011

Why to explicitly specify the Map/Reduce output parameters using JobConf?

Some thing really simple ....

Q) Why to explicitly specify the Map/Reduce output parameters using JobConf#setMapOutputKeyClass, JobConf#setMapOutputValueClass, JobConf#setOutputKeyClass and JobfConf#setOutputValueClass methods, can't Hadoop framework deduce them from the Mapper/Reducer implementations using Reflection?

A) This is due to Type Erasure :

When a generic type is instantiated, the compiler translates those types by a technique called type erasure — a process where the compiler removes all information related to type parameters and type arguments within a class or method. Type erasure enables Java applications that use generics to maintain binary compatibility with Java libraries and applications that were created before generics.

Nice to learn new things !!!

Wednesday, November 16, 2011

Hadoop release / version numbers

Edit: For easier access I have moved this to the pages section just below the blog header and no more maintaining this entry.

Software release numbers and features are daunting, remember Windows 1.0, 2.0, 2.1x, 3.0, 3.1, 95, 98, Me, NT WorkStation, 2000, XP, Vista, 7, 8 etc (I might have missed some of them). Microsoft seems to learning lately a bit with Windows 7, 8 naming. Ubuntu has a nice release scheme. The current release is 11.10 (the 1st number is the year and the 2nd number is the month of release), which says that it was released on October, 2011 and the next release number will be 12.04 (sometime around April, 2012). Ubuntu also has also clear guidelines on how long they would be supporting each version of Ubuntu.

Coming to Hadoop, there are multiple releases (0.20.*, 0.21, 0.22, 0.23 etc) and À la carte of features available in each of those release (CHANGES.txt in the release will have the JIRA's that have been pulled into that release) and users of Hadoop are confused on what release to pick. Some of these release are stable and some of them aren't. There is a lenghty discussion going in the Hadoop Groups to make the release numbers easy for everyone. Currently, 0.20.security, 0.22 and 0.23 are the releases on which work is happening actively. Proposal is to call them release 1, 2 and 3 for the coming releases, but it has yet to be finalized. 0.23 has been released recently, but is not production ready yet.

Besides the other improvements releated to HDFS, here is how the old/new MR API and Engine are supported in the different releases of Hadoop.

	Old API	New API	Class MR Engine	New MR Engine-MRv2
0.20.X	Y	Y	Y	N
0.22	Y	Y	Y	N
0.23	Y(deprecated)	Y	N	Y

Tuesday, November 15, 2011

IPC between Hadoop Daemons

Sunday, November 13, 2011

Seamless access to the Java SE API documentation

API documentation for Java SE and Hadoop (and other frameworks) can be downloaded for offline access. But, the Hadoop API documentation is not aware of the offline copy of Java SE documentation.

For seamless interaction between the two API, reference to

http://java.sun.com/....../ConcurrentLinkedQueue.html

in the Hadoop API should be replaced with

file:///home/praveensripati/....../ConcurrentLinkedQueue.html

The below command will replace all such references in the API documentation (note that back slash has to be escaped)

find ./ -type f -name *.html -exec sed -i 's/http:\/\/java.sun.com\/javase\/6\/docs\//file:\/\/\/home\/praveensripati\/Documents\/Java6API\//' {} \;

This enables for seamless offline access and better productivity.

Saturday, November 12, 2011

PDF Reader with Annoations - Okular

For those who store much information digitally and less on paper, Okular is a nice software to create annotations in PDF documents. Okular runs on multiple platforms, but I had been using it on Ubuntu for some time and quite happy with it.

'sudo apt-get install okular' will install Okular on Ubuntu. The annotations are stored in the $HOME/.kde/share/apps/okular/docdata folder as a separate file for each pdf document. By creating a symlink for this folder in DropBox, the annotations are automatically backed up.

This is no way related to Hadoop, but thought of blogging it. BTW, there a quite a number of free and very useful software like GIMP for Linux.

Thursday, November 10, 2011

Retrieving Hadoop Counters in Map/Reduce Tasks

Hadoop uses Counters to gather metrics/statistics which can later be analyzed for performance tuning or to find bugs in the MapReduce programs. There are some predefined Counters and Custom counters can also be defined. JobCounter and TaskCounter contain the predefined Counters in Hadoop. There are lot of tutorials on incrementing the Counters from the Map and Reduce tasks. But, how to fetch the current value of the Counter from with the Map and Reduce tasks.

Counters can be incremented using the Reporter for the Old MapReduce API or by using the Context using the New MapReduce API. These counters are sent to the TaskTracker and the TaskTracker will send to the JobTracker and the JobTracker will consolidate the Counters to produce a holistic view for the complete Job. The consolidated Counters are not relayed back to the Map and the Reduce tasks by the JobTracker. So, the Map and Reduce tasks have to contact the JobTracker to get the current value of the Counter.

StackOverflow Query has the details on how to get the current value of a Counter from within a Map and Reduce task.

Edit: Looks like it's not a good practice to retrieve the counters in the map and reduce tasks. Here is an alternate approach for passing the summary details from the mapper to the reducer. This approach requires some effort to code, but is doable. It would have been nice if the feature had been part of Hadoop and not required to hand code it.

Friday, November 4, 2011

Hadoop MapReduce challenges in the Enterprise

Platform Computing published a five part series (one, two, three, four, five) about the Hadoop MapReduce Challenges in the Enterprise. Some of the challenges mentioned in the Series are addressed by the NextGen MapReduce which will be available soon for download, but some of the allegations are not proper. Platform has got products around MapReduce and is about to be acquired by IBM, so not sure how they got them wrong.

Platform) On the performance measure, to be most useful in a robust enterprise environment a MapReduce job should take sub-millisecond to start, but the job startup time in the current open source MapReduce implementation is measured in seconds.

Praveen) MapReduce is supposed to be for batch processing and not for online transactions. The data from a MapReduce Job can be fed to a system for online processing. It's not to say that there is no scope for improvement in the MapReduce job performance.

Platform) The current Hadoop MapReduce implementation does not provide such capabilities. As a result, for each MapReduce job, a customer has to assign a dedicated cluster to run that particular application, one at a time.

Platform) Each cluster is dedicated to a single MapReduce application so if a user has multiple applications, s/he has to run them in serial on that same resource or buy another cluster for the additional application.

Praveen) Apache Hadoop had a pluggable Scheduler architecture and has Capacity, Fair and FIFO Scheduler. The FIFO Scheduler is the default one. Schedulers allow multiple applications and multiple users to share the cluster at the same time.

Platform) Current Hadoop MapReduce implementations derived from open source are not equipped to address the dynamic resource allocation required by various applications.

Platform) Customers also need support for workloads that may have different characteristics or are written in different programming languages. For instance, some of those applications could be data intensive such as MapReduce applications written in Java, some could be CPU intensive such as Monte Carlo simulations which are often written in C++ -- a runtime engine must be designed to support both simultaneously.

Praveen) NextGen MapReduce allows for dynamic allocation of resources. Currently there is only support for RAM based requests, but the framework can be extended for other parameters like CPU, HDD etc in the future.

Platform) As mentioned in part 2 of this blog series, the single job tracker in the current Hadoop implementation is not separated from the resource manager, so as a result, the job tracker does not provide sufficient resource management functionalities to allow dynamic lending and borrowing of available IT resources.

Praveen) NextGen MapReduce separates resource management and task scheduling into separate components.

To summarize, NextGen MapReduce addresses some of the concerns raised by Platform, but it will take some time for NextGen MapReduce to get stabilized and be production ready.

Wednesday, November 2, 2011

Hadoop Jar Hell

It's just not possible to download the latest Hadoop related Projects from Apache and use them together because of the interoperability issues among the different Hadoop Projects and their release cycles.

That's the reason why BigTop an Apache Incubator project has evolved, to solve the interoperability issues around the different Hadoop projects by providing a test suite. Also, companies like Cloudera provide their own distribution with different Hadoop projects based on Apache distribution, with proper testing and support.

Now HortonWorks which has been spun from Yahoo joined the same ranks. Their initial manifesto was to make the Apache downloads a source where anyone can download the jars and use them without any issues. But, they have moved away from this with the recent announcement of the HortonWorks Data Platform which is again based on Apache distribution similar to what Cloudera has done with their CDH distributions. Although, HortonWorks and Cloudera have their own distribution, they would be actively contributing the Apache Hadoop ecosystem.

With the maturity of BigTop it would be possible to download different Hadoop related jar files from Apache and use them directly instead of depending on the distributions from HortonWorks and Cloudera.

As mentioned in the GigaOm Article, such distributions from HortonWorks and Cloudera make them easy to support their customers as they have to support limited number of Hadoop versions and they would also know the potential issues with those versions.

Friday, October 28, 2011

Hadoop and MapReduce Algorithms Academic Papers

Hadoop in spite of starting in a web based company (Yahoo), has spawned to solve problems in many other disciplines. Amund has consolidated list of 'Hadoop and MapReduce Algorithms Academic Papers'. This will give an idea where all Hadoop and MapReduce can be used, some of them can be ideas for projects also. The Atbrox Blog is maintaining this list.

Nice blogs on Hadoop and Distributed Computing

Here are some of the blogs I had been following to keep myself updated with the latest happening around Hadoop and Distributed Computing. If I missed any blogs, please add them in the comments and I will add them to the list.

http://www.allthingsdistributed.com/

http://allthingshadoop.com/

http://www.asterdata.com/blog/

http://atbrox.com/

http://www.cloudera.com/blog/

http://developer.yahoo.com/blogs/hadoop/

http://hadoopblog.blogspot.com/

http://www.hortonworks.com/blog/

http://www.mapr.com/blog/

http://www.michael-noll.com/blog/

http://platformcomputing.blogspot.com/

http://steveloughran.blogspot.com/

http://tdunning.blogspot.com/

Thursday, October 27, 2011

'Hadoop - The Definitive Guide' Book Review

For those who are interested and serious in getting into Hadoop, besides going through the tons of articles and tutorials on the Internet, 'Hadoop - The Definitive Guide' (2nd Edition) by Tom White is a must have book. Most of the tutorials stop with the 'Word Count' example, but this book goes into the next level explaining the nuts-n-bolts of the Hadoop framework with a lot of examples and references. The most interesting and important thing is that the book also mentions why certain design decisions where made in Hadoop.

Not only the book covers HDFS and MapReduce, but also gives an overview of the layers which sit on top of Hadoop like Pig, Hive, HBase, ZooKeeper and Sqoop.

The book could definitely have the following

MapReduce is covered in detail, but HDFS internals and fine-tuning are at a high-level.
Also, to be in sync with Hadoop development and features, it's absolutely necessary to get source from trunk or from another branch and build, package and try it out.
NextGen MapReduce, HDFS Federation and a slew of other features which are being released as part of Hadoop Release 0.23.

The 3rd Edition of same book is due on April 30th, 2012 and looks like it has more case studies as well as new material on MRv2. The 3rd Edition of the book is worth waiting, but for the impatient who want to get started immediately the 2nd Edition is a must have.

Saturday, October 22, 2011

Hadoop on Windows

As some of you might have read HortonWorks and Microsoft have partnered, to get Hadoop running on Windows. Till date, Hadoop is being run only on Linux in production, but on Windows and Linux for development. In the future, we would also be seeing Hadoop on Windows in production.

- It's not the first time Hadoop and Microsoft came together. Microsoft acquired semantic search engine PowerSet, which is now part of the Bing search engine. PowerSet internally used Hadoop. Later read that, Hadoop has been replaced with some other software by Microsoft after acquisition (disclaimer : not 100% sure about it).

- Then there is Dryad (platform for distributed computing) and DryadLINQ (high level abstraction language for distributed computing) from Microsoft. DryadLINQ is tightly integrated with .NET and Windows and would be running much more efficiently on Windows than Hadoop on Windows. Not sure, if Microsoft will give enough focus on Hadoop along with Dryad.

- Apache Hadoop documentation recommends Oracle JDK 6. Apache Hadoop unpatched doesn't run neither on IBM JDK/defunct Apache Harmony/Open JDK, now Windows is being added to the mix.

- I am not a performance expert on cross-platform applications, but it might be a challenge to make same version of Hadoop perform better on Linux and Windows at the same time.

- The one good thing about all of this is Microsoft would be contributing the code back to Apache and there would be more eyes looking at the Hadoop code. Also, Microsoft is having it's employees to work on Hadoop and not outsource it.

- Also, as Steve mentioned there is a very little chance that Hadoop on Windows will be deployed for internal use. So, someone outside has to step-up and deploy/find bugs in Hadoop on a big cluster.

Considering all there factors, let's wait and see if there would be more Hadoop on Linux or Windows.

Edit: I was a bit skeptical about Microsoft's commit to Hadoop. But, looks like Microsoft is jumping into Hadoop all the way. This is a good news for Hadoop.

Edit (13th December, 2011) : Microsoft to allow limited preview on Hadoop on Azure.

Edit (15th December, 2011): The following url point to WIP documentation for Hadoop on Windows.

Edit (12th Jaunary, 2012): Avkash had been passionately blogging about Hadoop on Windows Azure.

Oh My God - Ubuntu !!!!

I have been using Ubuntu for more than 3 years and had been actively using it for the last 1 year, since I am working more and more on Hadoop. Ubuntu was supposed to be the flavor of Linux which is easy to install and runs out-of-the-box with minimum additional softwares/configurations required. Some of the features which have been there in 11.04 have been missing in 11.10. Here are some of the gripes I have, these may not be significant for a Linux geek, but may be significant for someone who wants to get started with Ubuntu for the first time.

- Screen saver has been removed in 11.10, there is a blue-print (smile) for a new screen saver. When something is not ready for 11.10, then why remove the existing working software. To get the screen saver running in 11.10, a couple of softwares had to be installed/removed and an application had to added to the start-up.

- User passwords can be changed from UI, but to make a user member of a group the terminal has to be used. Reason to use the CLI.

- After installing 11.10, the machine was a bit slower than 11.04. So, I used 'Startup Applications' to see the applications bought during boot. It did have only one in the list, which didn't make much sense. Later came to know that the startup applications were picked from /etc/xdg/autostart application and these are not shown in the 'Startup Applications'. Another reason to use the CLI.

- There is no easy way to add/change the order of icons to the Unity Launcher. A couple of files have to be tweaked to get the 'quick list' in the launcher.

- With GNome 2x, launching a new application instance was just a matter of clicking the icons in the panel. With Unity, it takes two clicks to launch a new application instance. This reminds me of something which took something like 6-7 clicks in Windows XP to change something really simple.

- There are some nice features like integration with Open Stack. But, it's not worth a dime to someone who wants to do some basic stuff like chatting, browsing etc.

I am not against CLI, in fact I write shell scripts to automate tasks on a regular basis. But, I find it disturbing that some of the basic features have been removed from Ubuntu 11.10 and it takes more clicks to perform some action in Unity than in GNome 2.

Hope 12.04 is better !!!

Friday, September 30, 2011

Resources for NextGen MapReduce

Edit: For easier access I have moved this to the pages section just below the blog header and no more maintaining this entry.

'Next Genereation MR' or 'NextGen MR' or 'MRv2' or 'MR2' is a major revamp of the MapReduce engine and will part of the 0.23 release. MRv1 or the old MapReduce engine will be not be supported in 0.23 release. The underlying engine has been revamped in 0.23, but the API to interface with the engine remains the same. So, the existing MapReduce code for MRv1 engine should run without modifications on MRv2.

The architecture, information for building and running MRv2 is spread across and this blog entry will try to consolidate and present all the information available on MRv2. I will keep-on updating this blog entry as I get more information about MRv2, instead of creating a new one. So, bookmark this and check it often :).

Current Status

http://www.hortonworks.com/update-on-apache-hadoop-0-23/ - 27th September, 2011

http://www.cloudera.com/blog/2011/11/apache-hadoop-0-23-0-has-been-released/ - 15th November, 2011

http://hortonworks.com/apache-hadoop-is-here/ - 16th November, 2011

Home Page

http://hadoop.apache.org/common/docs/r0.23.0/

Architecture

The Hadoop Map-Reduce Capacity Scheduler

The Next Generation of Apache Hadoop MapReduce

Next Generation of Apache Hadoop MapReduce – The Scheduler

Detailed document on MRv2

Presentation

Quick view of MRv2

JIRAs

https://issues.apache.org/jira/browse/MAPREDUCE-279

Videos

Next Generation Hadoop MapReduce by Arun C. Murthy

Code

http://svn.apache.org/repos/asf/hadoop/common/branches/branch-0.23/

Building from code and executing and running a sample

http://svn.apache.org/repos/asf/hadoop/common/branches/branch-0.23/BUILDING.txt
http://svn.apache.org/repos/asf/hadoop/common/branches/branch-0.23/hadoop-mapreduce-project/INSTALL

Thursday, September 29, 2011

QGit - GUI for Git

Once the remote Apache Hadoop repository has been cloned locally, now it's time to use a GUI to walk through the development history of the trunk and the different branches/tags. Gitk (which comes with the installation of Git), EGit (Eclipse Plugin for Git) and QGit are some of the UIs for Git. QGit is more intuitive than others. We will go through the steps for installing, configuring and using QGit.

Installing QGit

QGit doesn't come with the default installation of Ubuntu and can be installed from the command line or using 'Synaptic Package Manager'. Root permissions are required to install QGit.

sudo apt-get install qgit

will install QGit. QGit can be started from the menu (Applications -> Programming -> qgit) or using 'qgit' command from the command line.

Configuring and using QGit

Checkout the branch for which the history has to be viewed in QGit using the 'git checkout <branch>' command. The output of the 'git branch' command shows the selected branch with a *.

To view the Git repository in QGit

- Start QGit from the menu or command line.
- Goto 'Edit -> Setting' and unselect the 'Show range select when opening a repository' and 'Reopen last repository on startup' options.

- Select 'File -> Open .....' and select the hadoop-common folder which was created with 'git clone git://git.apache.org/hadoop-common.git' command.

- Select 'View -> Toggle tree view' to show the 'Git tree' on the left side. The 'Git tree' pane on the left shows you the state of the tree at the commit that has been selected in the 'Rev list' pane.

- Navigate to a file in that tree and double click on it to see the content of the file at that version and an indication of who most recently changed each line before that version.

This covers the basics for viewing the development history of a specific branch in QGit.

Happy Gitting :)

Viewing Hadoop code in Git

Git is an open-source distributed version control system. Distributed means that the developer would be getting a local copy of the entire development history along with the code (it's like cloning an entire remote repository locally). Because of the local data, Git is much faster compared to other version control systems. With Subversion and other version control systems, only the source code will be copied locally and the central server has to be contacted for the development history.

Some of the advantages of Git are

1) working offline when there is limited or no network connection

2) browsing through the source code without explicitly downloading the code for a trunk or a branch/tag
3) checking the history of the trunk/tags/branches offline

4) when making a lot of changes to the Hadoop code and to keep them under version control locally

Hadoop is evolving at a very rapid pace and it would be beneficial to configure Git with Hadoop. We will go through the instructions for (1) and (2) and leave (3) and (4) for a later blog entry. The below instructions are for Ubuntu 11.04 and might slightly differ for other OS.

Installing Git

Git is not installed by default in Ubuntu 11.04 and can be installed from the command line or using 'Synaptic Package Manager'. Root permissions are required to install Git.

sudo apt-git install git

will install git.

'which git' will give the location of Git and also ensure that Git is installed properly.

Getting the code from Apache

Apache provides read-only Git mirrors of the Apache codebases. So, the code cannot be submitted through the Git mirrors to Apache.

git clone git://git.apache.org/hadoop-common.git

will copy the code along with the development history locally. A directory 'hadoop-common/.git' will be created which will contain all the information for Git to function.

'git clone' command will currently download about 150 MB of files and will take time based on the network bandwidth. This is a one-time download and all the updates to the code will be much faster.

All the commands from now on should be run from hadoop-common directory or one of it's sub-directory.

How to know the current branch?

git branch

will list all the local branches and current working branch will have a star beside it. By default, the 'trunk' is the current working branch.

Viewing the code for a particular branch

git checkout branch-0.23

will switch to the new branch context. Means, the code in the directory will be for the branch specified in the 'git checkout' command.

Getting the latest code from Apache

git pull

will fetch the data from the remote repository and merge it locally for the current branch you are in (the one with a * in the output of the 'git branch' command).

for i in `git branch | sed 's/^.//'`; do git checkout $i ; git pull; done

will fetch and merge for all the local branches ('git branch' command will give list of all the local branches).

In the coming blog entry, we will go through the development history of Hadoop using Git.

References

Git Home Page

Basic information about some of the Git commands

Git and Hadoop Wiki

Introduction to Git with Scott Chacon of GitHub - Scott (author of Pro Git) seems to be in a hurry in the video :)

Wednesday, September 28, 2011

Theme of the blog

Hadoop is an open-source Java framework for distributed processing of large amounts of data. Hadoop is based on MapReduce programming model published by Google. As you browse through the web, there is a better chance that you are touching Hadoop or MapReduce model in some way.

The beauty of open-source is that the framework is open for anyone to use and modify it to their own requirement with enough commitment. But, the big challenge in adopting Hadoop or in fact any other open-source framework is the lack of documentation. Even if it's there, it might be sparse or stale. And, sometimes no documentation is better than incorrect or outdated documentation.

As I journey through learning Hadoop, I will blog tips and tricks for the development, debugging and usage of Hadoop. Feel free to comment on the blog entries for any corrections or any better way of doing things

Pages