Monday, April 15, 2013

Getting the filename of the input block in Hadoop

Hadoop by default splits the input file into 64MB blocks and each block will be processed by a mapper task. For gathering the metrics across each file and not the entire set of files, it's required to get the file name within the mapper. Here is how to extract the file name of the split being processed using the old and the new MR API.

Using the old MR API

Add the below to the mapper class.
String fileName = new String();
public void configure(JobConf job)
{
   filename = job.get("map.input.file");
}

Using the new MR API

Add the below to the mapper class.
String fileName = new String();
protected void setup(Context context) throws java.io.IOException, java.lang.InterruptedException
{
   fileName = ((FileSplit) context.getInputSplit()).getPath().toString();

}
Now the String fileName can be used in the mapper code.

No comments:

Post a Comment