Uploaded image for project: 'HBase'
  1. HBase
  2. HBASE-27487

Slow meta can create pathological feedback loop with multigets

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 2.5.1, 2.4.15
    • 2.6.0, 2.4.16, 2.5.3
    • None
    • None

    Description

      This only affects the Table implementation in 2.x releases.

      Call stack

      When Table.batch is called, an AsyncProcessTask is created with SubmittedRows.ALL, which is sent to AsyncProcess.submit(). For the ALL case, this goes to submitAll which creates an AsyncRequestFutureImpl and then calls groupAndSendMultiAction on that.

      When a AsyncRequestFutureImpl is created, a RetryingTimeTracker is created and started as the last step of the constructor.

      In groupAndSendMultiAction, the first thing that has to be done is resolve the HRegionLocation for every action in the batch. This is currently done sequentially, with no timeout on the overall batch completion.

      Once all actions have been resolved, they are passed into sendMultiAction which creates a SingleServerRequestRunnable. Once that runnable is executed, the first thing it does is create a new MultiServerCallable using the same RetryingTimeTracker that was originally created way back.

      That callable extends CancellableRegionServerCallable, and the call method first checks the tracker.getRemainingTime() before actually doing any work. If exceeded, it throws an exception.

      Problem

      If meta is overloaded, or you send any sufficiently large batch of actions, the resolving of HRegionLocations (which happens sequentially) may take a while.

      Depending on the operation timeout configured for the client, that duration may already exceed that timeout before even reaching the CancellableRegionServerCallable.call().

      When the timeout is exceeded there, a DoNotRetryIOException is thrown. This is considered a cache clearing exception, so any locations that may have been slowly resolved earlier up the chain will be thrown away.

      If done with enough concurrency, this can create a feedback loop that is impossible to recover from.

      Potential Solutions

      1. Change the thrown exception type from DoNotRetryIOException to something more appropriate for the actual error (some sort of timeout exception). We'd have to make that exception a "special" exception in ClientExceptionUtil so that it doesn't clear the cache.
      2. Make DoNotRetryIOException itself a "special" exception. The point of clearing cache is to make retries more likely to succeed if the failure was related to a wrong location. But DoNotRetryIOException explicitly is not supposed to be retried, so you might think it shouldn't clear the cache as well. There are many usages of this exception, so it's hard to say for sure that this would be universally safe.
      3. Reset the RetryingTimeTracker after resolving region locations.

      I think I'd lean towards option 1, because it seems odd to say "don't retry in that case". In fact, retrying should be more likely to succeed because locations will have been resolved.

      Whichever we choose, I think we should additionally check the timeout in groupAndSendMultiAction after resolving each region location. We should not allow that process to exceed timeouts and currently it can way more than exceed them before finally being checked incidentally at the end.

      Attachments

        Issue Links

          Activity

            People

              baugenreich Briana Augenreich
              bbeaudreault Bryan Beaudreault
              Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: