Description
I've been debugging a case where a bunch of reduces were hanging for no apparent reason and then get killed because they did not do anything for 600 seconds. I figured that it's because we are stuck in a very long waiting time due to retry backoffs.
public static int RETRY_BACKOFF[] = { 1, 1, 1, 1, 2, 4, 8, 16, 32, 64 };
That means we wait 10 sec, 10 sec, 10, 10, ... then 640 sec. That's a long time, do we really need that much time to finally be warned that there's a bug in HBase?
Also, the places where we get this:
LOG.debug("reloading table servers because: " + t.getMessage());
should be more verbose. I my logs these are caused by a table not found but the only thing I see is "reloading table servers because: tableName".