Wednesday, November 30, 2011

Passing parameters to Mappers and Reducers

There might be a requirement to pass additional parameters to the mapper and reducers, besides the the inputs which they process. Lets say we are interested in Matrix multiplication and there are multiple ways/algorithms of doing it. We could send an input parameter to the mapper and reducers, based on which the appropriate way/algorithm is picked. There are multiple ways of doing this

Setting the parameter:

1. Use the -D command line option to set the parameter while running the job.

2. Before launching the job using the old MR API

JobConf job = (JobConf) getConf();
job.set("test", "123");

3. Before launching the job using the new MR API

Configuration conf = new Configuration();
conf.set("test", "123");
Job job = new Job(conf);

Getting the parameter:

1. Using the old API in the Mapper and Reducer. The JobConfigurable#configure has to be implemented in the Mapper and Reducer class.

private static Long N;
public void configure(JobConf job) {
    N = Long.parseLong(job.get("test"));
}

The variable N can then be used with the map and reduce functions.

2. Using the new API in the Mapper and Reducer. The context is passed to the setup, map, reduce and cleanup functions.

Configuration conf = context.getConfiguration();
String param = conf.get("test");

16 comments:

  1. Perfect, couldn't be easier. I was up and running with mappers and reducers taking parameters from the JobConf class in minutes with this information. Thanks!

    ReplyDelete
    Replies
    1. Tim - Thanks for the response. I thought a minute to write this blog entry or not, because it was very trivial. But, it got the most hits :)

      When I started with Hadoop I found that changes were happening at a very fast pace and sometimes I got on the wrong foot and so this blog.

      Hope you find the other entries here also helpful.

      Delete
  2. Praveen : Is there any means by which I can pass certain parameters from main to the partitioner function (my custom partitioner) ?

    ReplyDelete
    Replies
    1. Arun,

      One hack is to write the parameters in a file in HDFS and read them in the custom partitioner. I don't like this approach, there might be some better ways of solving it.

      Post the query in the Apache forums for a better response.

      Praveen

      Delete
  3. Any idea on how I can pass the ArrayList to the mapper. The very inefficient workaround I can think of is converting it to String. Also if you could suggest as how to I can an ArrayList to the driver method.
    Thank you!

    ReplyDelete
    Replies
    1. Write the data into HDFS (if the data is huge) and read it in the setup() of the mapper and reducer as required. Another option is to send the ArrayList as a String (if the data is small).

      There might be some better ways, which I am not aware of.

      Delete
  4. Thank you very much! You solved my problem. ^^ Thankyou Thankyou~

    ReplyDelete
  5. Dear Praveen,

    Thanks for your post, what would you do if you have many parameters? Is there a way to put the parameters in a settings file and make them available to the mapper/reducer?

    ReplyDelete
    Replies
    1. Dieter,

      I think you should make use of DistributedCache in case you have multiple parameters to be passed onto mapper/reducer

      Check this:
      http://hadoop.apache.org/docs/stable/mapred_tutorial.html#DistributedCache

      Delete
  6. Thanks for the post! Only the last solution worked for me in the new api. I would add that by using the getInt,setInt methods it would be slightly more efficient

    ReplyDelete
  7. It is important to know that Configuration object is cloned at some point, so the order is important. i.e.:
    Configuration conf = getConf();
    conf.set("mmsilist", mmsiList);
    conf.set("msgidlist", msgidList);//set BEFORE Job instance is created
    Job job = new Job(conf, "MyJob");
    //
    //If you try to set conf.set("mmsilist", mmsiList);
    after the job is instantiated, it will not work .

    ReplyDelete
  8. This comment has been removed by the author.

    ReplyDelete
  9. Hey, Thank you for your post, however I'm having problems. I'm using hadoop version 0.20.205, but context.getConfiguration(), java says context cannot be resolved. Is there a particular library I should be using? Is there a different variable I need to initialize first?

    Thanks!

    ReplyDelete
  10. magnificent publish, very informative. I wonder why the other specialists of this sector don’t notice this. You should continue your writing. I’m confident, you’ve a great readers’ base already!.
    Hadoop Training in hyderabad

    ReplyDelete