[HBASE-26552] Introduce retry to logroller to avoid abort - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 3.0.0-alpha-2, 2.4.10
Fix Version/s: 2.5.0, 3.0.0-alpha-3, 2.4.11
Component/s: wal
Labels:
None

Release Note:

Hide
For retrying to roll log, the wait timeout is limited by "hbase.regionserver.logroll.wait.timeout.ms",
and the max retry time is limited by "hbase.regionserver.logroll.retries".
Do not retry to roll log is the default behavior.

Show
For retrying to roll log, the wait timeout is limited by "hbase.regionserver.logroll.wait.timeout.ms", and the max retry time is limited by "hbase.regionserver.logroll.retries". Do not retry to roll log is the default behavior.

Description

When calling RollController#rollWal in AbstractWALRoller, the regionserver may abort when encounters exception,

...
} catch (FailedLogCloseException | ConnectException e) {
  abort("Failed log close in log roller", e);
} catch (IOException ex) {
  // Abort if we get here. We probably won't recover an IOE. HBASE-1132
  abort("IOE in log roller",
    ex instanceof RemoteException ? ((RemoteException) ex).unwrapRemoteException() : ex);
} catch (Exception ex) {
  LOG.error("Log rolling failed", ex);
  abort("Log rolling failed", ex);
}

I think we should support retry of rollWal here to avoid recovering the service by killing regionserver. The restart of regionserver is costly and very not friendly to the availability.

I find that when creating new writer for the WAL in FanOutOneBlockAsyncDFSOutputHelper#createOutput, it supports retry to addBlock by setting this config "hbase.fs.async.create.retries". The idea of retry to roll WAL is similar to it, they both try best to make roll WAL succeed.

But the initialization of new WAL writer also includes flushing the write buffer flush and waiting until it is completed by AsyncProtobufLogWriter#writeMagicAndWALHeader, which can also fail by some hardware reasons. The regionserver connected to the datanodes after addBlock, but that not means the magic and header can be flushed successfully.

protected long writeMagicAndWALHeader(byte[] magic, WALHeader header) throws IOException {
  return write(future -> {
    output.write(magic);
    try {
      header.writeDelimitedTo(asyncOutputWrapper);
    } catch (IOException e) {
      // should not happen
      throw new AssertionError(e);
    }
    addListener(output.flush(false), (len, error) -> {
      if (error != null) {
        future.completeExceptionally(error);
      } else {
        future.complete(len);
      }
    });
  });
}

We have found that in our production clusters, there exists aborting of regionservers that caused by "IOE in log roller". And the practice in our clusters is that just one more retry of rollWal can make the WAL roll complete and continue serving.

Attachments

Issue Links

causes

HBASE-26840 Fix NPE in the retry of logroller

Resolved

relates to

HBASE-26715 Blocked on SyncFuture in AsyncProtobufLogWriter#write

Resolved

links to

GitHub Pull Request #4038

GitHub Pull Request #4170

GitHub Pull Request #4171

Activity

People

Assignee:: Xiaolin Ha

Reporter:: Xiaolin Ha

Votes:: 0 Vote for this issue

Watchers:: 6 Start watching this issue

Dates

Created:: 09/Dec/21 07:18

Updated:: 01/Sep/22 10:57

Resolved:: 07/Mar/22 08:43