Uploaded image for project: 'HBase'
  1. HBase
  2. HBASE-28113

Modify the way of acquiring the RegionStateNode lock in checkOnlineRegionsReport to tryLock

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • None
    • 2.6.0, 2.4.18, 3.0.0-beta-1, 2.5.7
    • master
    • None
    • Reviewed

    Description

      HBase Cluster description: 1 master and 5 region servers

      During the execution of itbll process, when ChaosMonkey performs RestartRandomRsAction, it triggers this issue.

      The steps for the RestartRandomRsAction operation are as follows.

      1. stop node-3, node-2, node-4。
      2. then stop the node-5 that holds the meta node.
      3. start node-3
      4. then stop node-1。
      5. start node-2, node-4, node-5, node-1。

      Fault description:

      1. The RegionServer nodes, including node-2, node-4, node-5, and node-1, are unable to come online.

      Observing the RegionServer logs, the reportForDuty operation consistently times out. The log is as follows:

      2023-09-21T08:05:30,251 INFO  [regionserver/core-1-2:16020] regionserver.HRegionServer: reportForDuty to master=master-1-1,16000,1695254395517 with port=16020, startcode=1695254725874
      2023-09-21T08:05:43,581 INFO  [regionserver/core-1-2:16020] regionserver.HRegionServer: reportForDuty to master=master-1-1,16000,1695254395517 with port=16020, startcode=1695254725874
      2023-09-21T08:05:59,591 INFO  [regionserver/core-1-2:16020] regionserver.HRegionServer: reportForDuty to master=master-1-1,16000,1695254395517 with port=16020, startcode=1695254725874
      2023-09-21T08:06:21,601 INFO  [regionserver/core-1-2:16020] regionserver.HRegionServer: reportForDuty to master=master-1-1,16000,1695254395517 with port=16020, startcode=1695254725874
      2023-09-21T08:06:55,611 INFO  [regionserver/core-1-2:16020] regionserver.HRegionServer: reportForDuty to master=master-1-1,16000,1695254395517 with port=16020, startcode=1695254725874
      2023-09-21T08:07:53,620 INFO  [regionserver/core-1-2:16020] regionserver.HRegionServer: reportForDuty to master=master-1-1,16000,1695254395517 with port=16020, startcode=1695254725874
      2023-09-21T08:09:39,631 INFO  [regionserver/core-1-2:16020] regionserver.HRegionServer: reportForDuty to master=master-1-1,16000,1695254395517 with port=16020, startcode=1695254725874
      2023-09-21T08:13:01,642 INFO  [regionserver/core-1-2:16020] regionserver.HRegionServer: reportForDuty to master=master-1-1,16000,1695254395517 with port=16020, startcode=1695254725874 

      2. The master thread is blocked.

      • All two RpcServer.priority.RWQ.Fifo.write.handler threads are blocked on RegionStateNode.lock
      "RpcServer.priority.RWQ.Fifo.write.handler=1,queue=0,port=16000" #67 daemon prio=5 os_prio=0 tid=0x00007f6ae3caf800 nid=0xea405 waiting on condition [0x00007f6aa1dcd000]
         java.lang.Thread.State: WAITING (parking)
          at sun.misc.Unsafe.park(Native Method)
          - parking to wait for  <0x00000004e3c8e6f0> (a java.util.concurrent.locks.ReentrantLock$NonfairSync)
          at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
          at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
          at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:870)
          at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1199)
          at java.util.concurrent.locks.ReentrantLock$NonfairSync.lock(ReentrantLock.java:209)
          at java.util.concurrent.locks.ReentrantLock.lock(ReentrantLock.java:285)
          at org.apache.hadoop.hbase.master.assignment.RegionStateNode.lock(RegionStateNode.java:323)
          at org.apache.hadoop.hbase.master.assignment.AssignmentManager.checkOnlineRegionsReport(AssignmentManager.java:1401)
          at org.apache.hadoop.hbase.master.assignment.AssignmentManager.reportOnlineRegions(AssignmentManager.java:1363)
          at org.apache.hadoop.hbase.master.MasterRpcServices.regionServerReport(MasterRpcServices.java:639)
          at org.apache.hadoop.hbase.shaded.protobuf.generated.RegionServerStatusProtos$RegionServerStatusService$2.callBlockingMethod(RegionServerStatusProtos.java:17395)
          at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:437)
          at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:124)
          at org.apache.hadoop.hbase.ipc.RpcHandler.run(RpcHandler.java:102)
          at org.apache.hadoop.hbase.ipc.RpcHandler.run(RpcHandler.java:82) 
      • 20 PEWorker threads are blocked on RegionStateStore.updateRegionLocation.
      "PEWorker-1" #133 daemon prio=5 os_prio=0 tid=0x00007f6acdcf9800 nid=0xea5bc waiting on condition [0x00007f6a9d799000]
         java.lang.Thread.State: WAITING (parking)
          at sun.misc.Unsafe.park(Native Method)
          - parking to wait for  <0x00000004e4cc8e58> (a java.util.concurrent.CompletableFuture$Signaller)
          at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
          at java.util.concurrent.CompletableFuture$Signaller.block(CompletableFuture.java:1707)
          at java.util.concurrent.ForkJoinPool.managedBlock(ForkJoinPool.java:3323)
          at java.util.concurrent.CompletableFuture.waitingGet(CompletableFuture.java:1742)
          at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1908)
          at org.apache.hadoop.hbase.util.FutureUtils.get(FutureUtils.java:182)
          at org.apache.hadoop.hbase.client.TableOverAsyncTable.put(TableOverAsyncTable.java:213)
          at org.apache.hadoop.hbase.master.assignment.RegionStateStore.updateRegionLocation(RegionStateStore.java:259)
          at org.apache.hadoop.hbase.master.assignment.RegionStateStore.updateRegionLocation(RegionStateStore.java:224)
          at org.apache.hadoop.hbase.master.assignment.AssignmentManager.regionClosedAbnormally(AssignmentManager.java:2076)
          at org.apache.hadoop.hbase.master.assignment.RegionRemoteProcedureBase.execute(RegionRemoteProcedureBase.java:305)
          at org.apache.hadoop.hbase.master.assignment.RegionRemoteProcedureBase.execute(RegionRemoteProcedureBase.java:57)
          at org.apache.hadoop.hbase.procedure2.Procedure.doExecute(Procedure.java:921)
          at org.apache.hadoop.hbase.procedure2.ProcedureExecutor.execProcedure(ProcedureExecutor.java:1650)
          at org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeProcedure(ProcedureExecutor.java:1396)
          at org.apache.hadoop.hbase.procedure2.ProcedureExecutor.access$1000(ProcedureExecutor.java:75)
          at org.apache.hadoop.hbase.procedure2.ProcedureExecutor$WorkerThread.runProcedure(ProcedureExecutor.java:1962)
          at org.apache.hadoop.hbase.procedure2.ProcedureExecutor$WorkerThread$$Lambda$610/726348606.call(Unknown Source)
          at org.apache.hadoop.hbase.trace.TraceUtil.trace(TraceUtil.java:216)
          at org.apache.hadoop.hbase.procedure2.ProcedureExecutor$WorkerThread.run(ProcedureExecutor.java:1989) 
      • All four KeepAlivePEWorker threads are blocked.

      KeepAlivePEWorker-17 18 19 are blocked on RegionStateStore.updateRegionLocation

      "KeepAlivePEWorker-17" #381 daemon prio=5 os_prio=0 tid=0x000056260b75d000 nid=0xeffb0 waiting on condition [0x00007f6a94339000]
         java.lang.Thread.State: WAITING (parking)
          at sun.misc.Unsafe.park(Native Method)
          - parking to wait for  <0x00000004ebf83440> (a java.util.concurrent.CompletableFuture$Signaller)
          at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
          at java.util.concurrent.CompletableFuture$Signaller.block(CompletableFuture.java:1707)
          at java.util.concurrent.ForkJoinPool.managedBlock(ForkJoinPool.java:3323)
          at java.util.concurrent.CompletableFuture.waitingGet(CompletableFuture.java:1742)
          at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1908)
          at org.apache.hadoop.hbase.util.FutureUtils.get(FutureUtils.java:182)
          at org.apache.hadoop.hbase.client.TableOverAsyncTable.put(TableOverAsyncTable.java:213)
          at org.apache.hadoop.hbase.master.assignment.RegionStateStore.updateRegionLocation(RegionStateStore.java:259)
          at org.apache.hadoop.hbase.master.assignment.RegionStateStore.updateRegionLocation(RegionStateStore.java:224)
          at org.apache.hadoop.hbase.master.assignment.AssignmentManager.transitStateAndUpdate(AssignmentManager.java:1982)
          at org.apache.hadoop.hbase.master.assignment.AssignmentManager.regionOpening(AssignmentManager.java:1997)
          at org.apache.hadoop.hbase.master.assignment.TransitRegionStateProcedure.openRegion(TransitRegionStateProcedure.java:279)
          at org.apache.hadoop.hbase.master.assignment.TransitRegionStateProcedure.executeFromState(TransitRegionStateProcedure.java:434)
          at org.apache.hadoop.hbase.master.assignment.TransitRegionStateProcedure.executeFromState(TransitRegionStateProcedure.java:111)
          at org.apache.hadoop.hbase.procedure2.StateMachineProcedure.execute(StateMachineProcedure.java:188)
          at org.apache.hadoop.hbase.master.assignment.TransitRegionStateProcedure.execute(TransitRegionStateProcedure.java:398)
          at org.apache.hadoop.hbase.master.assignment.TransitRegionStateProcedure.execute(TransitRegionStateProcedure.java:111)
          at org.apache.hadoop.hbase.procedure2.Procedure.doExecute(Procedure.java:921)
          at org.apache.hadoop.hbase.procedure2.ProcedureExecutor.execProcedure(ProcedureExecutor.java:1650)
          at org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeProcedure(ProcedureExecutor.java:1396)
          at org.apache.hadoop.hbase.procedure2.ProcedureExecutor.access$1000(ProcedureExecutor.java:75)
          at org.apache.hadoop.hbase.procedure2.ProcedureExecutor$WorkerThread.runProcedure(ProcedureExecutor.java:1962)
          at org.apache.hadoop.hbase.procedure2.ProcedureExecutor$WorkerThread$$Lambda$610/726348606.call(Unknown Source)
          at org.apache.hadoop.hbase.trace.TraceUtil.trace(TraceUtil.java:216)
          at org.apache.hadoop.hbase.procedure2.ProcedureExecutor$WorkerThread.run(ProcedureExecutor.java:1989) 
      • KeepAlivePEWorker-20 are blocked on RegionStateNode.lock
      "KeepAlivePEWorker-20" #388 daemon prio=5 os_prio=0 tid=0x000056260b847800 nid=0xf02da waiting on condition [0x00007f6a92e25000]
         java.lang.Thread.State: WAITING (parking)
          at sun.misc.Unsafe.park(Native Method)
          - parking to wait for  <0x00000004e3c8d990> (a java.util.concurrent.locks.ReentrantLock$NonfairSync)
          at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
          at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
          at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:870)
          at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1199)
          at java.util.concurrent.locks.ReentrantLock$NonfairSync.lock(ReentrantLock.java:209)
          at java.util.concurrent.locks.ReentrantLock.lock(ReentrantLock.java:285)
          at org.apache.hadoop.hbase.master.assignment.RegionStateNode.lock(RegionStateNode.java:323)
          at org.apache.hadoop.hbase.master.procedure.ServerCrashProcedure.assignRegions(ServerCrashProcedure.java:551)
          at org.apache.hadoop.hbase.master.procedure.ServerCrashProcedure.executeFromState(ServerCrashProcedure.java:243)
          at org.apache.hadoop.hbase.master.procedure.ServerCrashProcedure.executeFromState(ServerCrashProcedure.java:68)
          at org.apache.hadoop.hbase.procedure2.StateMachineProcedure.execute(StateMachineProcedure.java:188)
          at org.apache.hadoop.hbase.procedure2.Procedure.doExecute(Procedure.java:921)
          at org.apache.hadoop.hbase.procedure2.ProcedureExecutor.execProcedure(ProcedureExecutor.java:1650)
          at org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeProcedure(ProcedureExecutor.java:1396)
          at org.apache.hadoop.hbase.procedure2.ProcedureExecutor.access$1000(ProcedureExecutor.java:75)
          at org.apache.hadoop.hbase.procedure2.ProcedureExecutor$WorkerThread.runProcedure(ProcedureExecutor.java:1962)
          at org.apache.hadoop.hbase.procedure2.ProcedureExecutor$WorkerThread$$Lambda$610/726348606.call(Unknown Source)
          at org.apache.hadoop.hbase.trace.TraceUtil.trace(TraceUtil.java:216)
          at org.apache.hadoop.hbase.procedure2.ProcedureExecutor$WorkerThread.run(ProcedureExecutor.java:1989) 

      Attachments

        1. master.stack
          174 kB
          Haiping lv

        Issue Links

          Activity

            People

              luoen Haiping lv
              luoen Haiping lv
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: