Sunday, December 25, 2011

Limiting the usage Counters in Hadoop

Besides the JobCounter and the TaskCounter counters which Hadoop framework maintains, it's also possible to define custom counters for application level statistics.

Counters can be incremented using the Reporter for the Old MapReduce API or by using the Context using the New MapReduce API. These counters are sent to the TaskTracker and the TaskTracker will send to the JobTracker and the JobTracker will consolidate the Counters to produce a holistic view for the complete Job.

There is a chance that a rogue Job creates millions of counters and since these counters are stored in the JobTracker, there is a better chance that JobTracker will go OOM. To avoid such a scenario the number of counters that can be created per Job are limited by the Hadoop framework.

The following code is in the Counters.java. Note that this code is in the 20.203, 20.204 and 20.205 (now called 1.0) releases. Also note that some of the parameters are configurable and some are not.

/** limit on the size of the name of the group **/
private static final int GROUP_NAME_LIMIT = 128;
/** limit on the size of the counter name **/
private static final int COUNTER_NAME_LIMIT = 64;

private static final JobConf conf = new JobConf();
/** limit on counters **/
public static int MAX_COUNTER_LIMIT = 
conf.getInt("mapreduce.job.counters.limit", 120);

/** the max groups allowed **/
static final int MAX_GROUP_LIMIT = 50;

In trunk and 0.23 release the below code is there in the MRJobConfig.java. Note that the parameters are configurable.

public static final String COUNTERS_MAX_KEY = "mapreduce.job.counters.max";
public static final int COUNTERS_MAX_DEFAULT = 120;

public static final String COUNTER_GROUP_NAME_MAX_KEY = "mapreduce.job.counters.group.name.max";
public static final int COUNTER_GROUP_NAME_MAX_DEFAULT = 128;

public static final String COUNTER_NAME_MAX_KEY = "mapreduce.job.counters.counter.name.max";
public static final int COUNTER_NAME_MAX_DEFAULT = 64;

public static final String COUNTER_GROUPS_MAX_KEY = "mapreduce.job.counters.groups.max";
public static final int COUNTER_GROUPS_MAX_DEFAULT = 50;

The above mentioned configuration parameters are not mentioned in the release documentation an so thought it would be worth mentioning as a blog entry.

I would like to call it a hidden jewel :)

6 comments:

  1. Configurables are good but raise beyond defaults with great care. Configurations are for use-case specific needs, not for raising it when you hit an issue and reading around the issue points to a potential workaround.

    You should also consider contributing a doc patch upstream if you noticed its not documented - cause hidden jewels may also bite devs, and you can help prevent that.

    ReplyDelete
    Replies
    1. Harsh - thnx for the feedback - I 100% agree with you on the documentation - the blog is an effort for the same - regarding updating the Apache, inertia is pulling me back which I am trying to overcome.

      Also, Cloudera developed a tool which picks the configuration properties added between releases. Although it's not 100% accurate, the tool can be included in the Apache build process.

      Delete
  2. I don't think it's configurable. If you read the Limits.java code, you'll find that it use an private JobConf to get COUNTERS_MAX. Although you set the property in your Job, it's not passed to the Limits.java, kind of weird.

    ReplyDelete
  3. I tried setting both the counters to 500, I still see the exception
    Details: http://stackoverflow.com/questions/12140177/more-than-120-counters-in-hadoop/23181414#23181414

    ReplyDelete
  4. This comment has been removed by a blog administrator.

    ReplyDelete