Uploaded image for project: 'HBase'
  1. HBase
  2. HBASE-26866

Shutdown WAL may abort region server

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • None
    • 2.6.0, 3.0.0-alpha-3, 2.4.17, 2.5.4
    • wal
    • None
    • Reviewed

    Description

      https://nightlies.apache.org/hbase/HBase-Flaky-Tests/master/3140/hbase-server/target/surefire-reports/org.apache.hadoop.hbase.replication.TestSyncReplicationActive-output.txt

      TestSyncReplicationAcive is flaky because of we may abort the region server when shutting down WAL.

      2022-03-18T04:50:37,205 WARN  [RpcServer.default.FPBQ.Fifo.handler=2,queue=0,port=36877] master.MasterRpcServices(682): jenkins-hbase13.apache.org,33377,1647579008859 reported a fatal error:
      ***** ABORTING region server jenkins-hbase13.apache.org,33377,1647579008859: Log rolling failed *****
      Cause:
      java.util.concurrent.RejectedExecutionException: Task org.apache.hadoop.hbase.regionserver.wal.AbstractFSWAL$$Lambda$681/1458648270@37209753 rejected from java.util.concurrent.ThreadPoolExecutor@69662eb7[Shutting down, pool size = 1, active threads = 1, queued tasks = 0, completed tasks = 0]
      	at java.util.concurrent.ThreadPoolExecutor$AbortPolicy.rejectedExecution(ThreadPoolExecutor.java:2063)
      	at java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:830)
      	at java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1379)
      	at java.util.concurrent.Executors$DelegatedExecutorService.execute(Executors.java:668)
      	at org.apache.hadoop.hbase.regionserver.wal.AbstractFSWAL.cleanOldLogs(AbstractFSWAL.java:773)
      	at org.apache.hadoop.hbase.regionserver.wal.AbstractFSWAL.rollWriterInternal(AbstractFSWAL.java:935)
      	at org.apache.hadoop.hbase.regionserver.wal.AbstractFSWAL.lambda$rollWriter$8(AbstractFSWAL.java:953)
      	at org.apache.hadoop.hbase.trace.TraceUtil.trace(TraceUtil.java:196)
      	at org.apache.hadoop.hbase.regionserver.wal.AbstractFSWAL.rollWriter(AbstractFSWAL.java:953)
      	at org.apache.hadoop.hbase.wal.AbstractWALRoller$RollController.rollWal(AbstractWALRoller.java:316)
      	at org.apache.hadoop.hbase.wal.AbstractWALRoller.run(AbstractWALRoller.java:214)
      

      The problem here is that, the removal of WAL is async, when shuttting down the WAL, we will close the thread pool so it will throw reject execution exception and cause region server abort.

      Attachments

        Issue Links

          Activity

            People

              zhangduo Duo Zhang
              zhangduo Duo Zhang
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: