Uploaded image for project: 'HBase'
  1. HBase
  2. HBASE-25887

Corrupt wal while region server is aborting.

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 1.6.0
    • 1.7.0
    • regionserver, wal
    • None
    • Reviewed

    Description

      We have seen a case in our production cluster where we ended up in corrupt wal. WALSplitter logged the below error

      2021-05-12 00:42:46,786 FATAL [XXXX:60020-1] regionserver.HRegionServer - ABORTING region server HOST-B,60020,16207794418
      88: Caught throwable while processing event RS_LOG_REPLAY
      java.lang.NullPointerException
              at org.apache.hadoop.hbase.CellUtil.matchingFamily(CellUtil.java:411)
              at org.apache.hadoop.hbase.regionserver.wal.WALEdit.isMetaEditFamily(WALEdit.java:145)
              at org.apache.hadoop.hbase.regionserver.wal.WALEdit.isMetaEdit(WALEdit.java:150)
              at org.apache.hadoop.hbase.wal.WALSplitter.splitLogFile(WALSplitter.java:408)
              at org.apache.hadoop.hbase.wal.WALSplitter.splitLogFile(WALSplitter.java:261)
              at org.apache.hadoop.hbase.regionserver.SplitLogWorker$1.exec(SplitLogWorker.java:105)
              at org.apache.hadoop.hbase.regionserver.handler.WALSplitterHandler.process(WALSplitterHandler.java:72)
              at org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:129)
              at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
              at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
              at java.lang.Thread.run(Thread.java:748)
      

      Looking at the raw wal file, we could see that the last WALEdit contains the region id, tablename and sequence number but cells were not persisted.
      Looking at the logs of the RS that generated that corrupt wal file,

      2021-05-11 23:29:22,114 DEBUG [/HOST-A:60020] wal.FSHLog - Closing WAL writer in /hbase/WALs/HOST-A,60020,1620774393046
      2021-05-11 23:29:22,196 DEBUG [/HOST-A:60020] ipc.AbstractRpcClient - Stopping rpc client
      2021-05-11 23:29:22,198 INFO  [/HOST-A:60020] regionserver.Leases - regionserver/HOST-A/:60020 closing leases
      2021-05-11 23:29:22,198 INFO  [/HOST-A:60020] regionserver.Leases - regionserver/HOST-A:/HOST-A:60020 closed leases
      2021-05-11 23:29:22,198 WARN  [0020.append-pool8-t1] wal.FSHLog - Append sequenceId=7147823, requesting roll of WAL
      java.nio.channels.ClosedChannelException
              at org.apache.hadoop.hdfs.DataStreamer$LastExceptionInStreamer.throwException4Close(DataStreamer.java:331)
              at org.apache.hadoop.hdfs.DFSOutputStream.checkClosed(DFSOutputStream.java:151)
              at org.apache.hadoop.fs.FSOutputSummer.write(FSOutputSummer.java:105)
              at org.apache.hadoop.fs.FSDataOutputStream$PositionCache.write(FSDataOutputStream.java:58)
              at java.io.DataOutputStream.write(DataOutputStream.java:107)
              at org.apache.hadoop.hbase.KeyValue.write(KeyValue.java:2543)
              at org.apache.phoenix.hbase.index.wal.KeyValueCodec.write(KeyValueCodec.java:104)
              at org.apache.hadoop.hbase.regionserver.wal.IndexedWALEditCodec$IndexKeyValueEncoder.write(IndexedWALEditCodec.java:218)
              at org.apache.hadoop.hbase.regionserver.wal.ProtobufLogWriter.append(ProtobufLogWriter.java:128)
              at org.apache.hadoop.hbase.regionserver.wal.FSHLog$RingBufferEventHandler.append(FSHLog.java:2083)
              at org.apache.hadoop.hbase.regionserver.wal.FSHLog$RingBufferEventHandler.onEvent(FSHLog.java:1941)
              at org.apache.hadoop.hbase.regionserver.wal.FSHLog$RingBufferEventHandler.onEvent(FSHLog.java:1857)
              at com.lmax.disruptor.BatchEventProcessor.run(BatchEventProcessor.java:129)
              at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
              at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
              at java.lang.Thread.run(Thread.java:748)
      

      These 2 lines are interesting.

      2021-05-11 23:29:22,114 DEBUG [/HOST-A:60020] wal.FSHLog - Closing WAL writer in /hbase/WALs/HOST-A,60020,1620774393046
      ....
      ....
      2021-05-11 23:29:22,198 WARN [0020.append-pool8-t1] wal.FSHLog - Append sequenceId=7147823, requesting roll of WAL
      java.nio.channels.ClosedChannelException

      The append thread encountered java.nio.channels.ClosedChannelException while writing to wal file because the wal file was already closed.

      This is the sequence of shutting down of threads when RS aborts.

            // With disruptor down, this is safe to let go.
            if (this.appendExecutor !=  null) this.appendExecutor.shutdown();
      
            // Tell our listeners that the log is closing
             ...
            if (LOG.isDebugEnabled()) {
              LOG.debug("Closing WAL writer in " + FSUtils.getPath(fullPathLogDir));
            }
            if (this.writer != null) {
              this.writer.close();
              this.writer = null;
            }
          }
      

      1. Shutdown the appendExecutor (but we don't wait for the currently running tasks to complete)
      2. Close the writer. This will in turn close the wal log file.

      We need to wait for appendExecutor to terminate properly and then only close the writer.

      Code has been changed a lot in master branch. Need to study whether this bug exists in master/branch-2

      Attachments

        Issue Links

          Activity

            People

              shahrs87 Rushabh Shah
              shahrs87 Rushabh Shah
              Votes:
              0 Vote for this issue
              Watchers:
              11 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: