Details
-
Improvement
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
None
-
None
-
Reviewed
Description
HBase Cluster description: 1 master and 5 region servers
During the execution of itbll process, when ChaosMonkey performs RestartRandomRsAction, it triggers this issue.
The steps for the RestartRandomRsAction operation are as follows.:
- stop node-3, node-2, node-4。
- then stop the node-5 that holds the meta node.
- start node-3
- then stop node-1。
- start node-2, node-4, node-5, node-1。
Fault description:
1. The RegionServer nodes, including node-2, node-4, node-5, and node-1, are unable to come online.
Observing the RegionServer logs, the reportForDuty operation consistently times out. The log is as follows:
2023-09-21T08:05:30,251 INFO [regionserver/core-1-2:16020] regionserver.HRegionServer: reportForDuty to master=master-1-1,16000,1695254395517 with port=16020, startcode=1695254725874 2023-09-21T08:05:43,581 INFO [regionserver/core-1-2:16020] regionserver.HRegionServer: reportForDuty to master=master-1-1,16000,1695254395517 with port=16020, startcode=1695254725874 2023-09-21T08:05:59,591 INFO [regionserver/core-1-2:16020] regionserver.HRegionServer: reportForDuty to master=master-1-1,16000,1695254395517 with port=16020, startcode=1695254725874 2023-09-21T08:06:21,601 INFO [regionserver/core-1-2:16020] regionserver.HRegionServer: reportForDuty to master=master-1-1,16000,1695254395517 with port=16020, startcode=1695254725874 2023-09-21T08:06:55,611 INFO [regionserver/core-1-2:16020] regionserver.HRegionServer: reportForDuty to master=master-1-1,16000,1695254395517 with port=16020, startcode=1695254725874 2023-09-21T08:07:53,620 INFO [regionserver/core-1-2:16020] regionserver.HRegionServer: reportForDuty to master=master-1-1,16000,1695254395517 with port=16020, startcode=1695254725874 2023-09-21T08:09:39,631 INFO [regionserver/core-1-2:16020] regionserver.HRegionServer: reportForDuty to master=master-1-1,16000,1695254395517 with port=16020, startcode=1695254725874 2023-09-21T08:13:01,642 INFO [regionserver/core-1-2:16020] regionserver.HRegionServer: reportForDuty to master=master-1-1,16000,1695254395517 with port=16020, startcode=1695254725874
2. The master thread is blocked.
- All two RpcServer.priority.RWQ.Fifo.write.handler threads are blocked on RegionStateNode.lock
"RpcServer.priority.RWQ.Fifo.write.handler=1,queue=0,port=16000" #67 daemon prio=5 os_prio=0 tid=0x00007f6ae3caf800 nid=0xea405 waiting on condition [0x00007f6aa1dcd000] java.lang.Thread.State: WAITING (parking) at sun.misc.Unsafe.park(Native Method) - parking to wait for <0x00000004e3c8e6f0> (a java.util.concurrent.locks.ReentrantLock$NonfairSync) at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836) at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:870) at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1199) at java.util.concurrent.locks.ReentrantLock$NonfairSync.lock(ReentrantLock.java:209) at java.util.concurrent.locks.ReentrantLock.lock(ReentrantLock.java:285) at org.apache.hadoop.hbase.master.assignment.RegionStateNode.lock(RegionStateNode.java:323) at org.apache.hadoop.hbase.master.assignment.AssignmentManager.checkOnlineRegionsReport(AssignmentManager.java:1401) at org.apache.hadoop.hbase.master.assignment.AssignmentManager.reportOnlineRegions(AssignmentManager.java:1363) at org.apache.hadoop.hbase.master.MasterRpcServices.regionServerReport(MasterRpcServices.java:639) at org.apache.hadoop.hbase.shaded.protobuf.generated.RegionServerStatusProtos$RegionServerStatusService$2.callBlockingMethod(RegionServerStatusProtos.java:17395) at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:437) at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:124) at org.apache.hadoop.hbase.ipc.RpcHandler.run(RpcHandler.java:102) at org.apache.hadoop.hbase.ipc.RpcHandler.run(RpcHandler.java:82)
- 20 PEWorker threads are blocked on RegionStateStore.updateRegionLocation.
"PEWorker-1" #133 daemon prio=5 os_prio=0 tid=0x00007f6acdcf9800 nid=0xea5bc waiting on condition [0x00007f6a9d799000] java.lang.Thread.State: WAITING (parking) at sun.misc.Unsafe.park(Native Method) - parking to wait for <0x00000004e4cc8e58> (a java.util.concurrent.CompletableFuture$Signaller) at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) at java.util.concurrent.CompletableFuture$Signaller.block(CompletableFuture.java:1707) at java.util.concurrent.ForkJoinPool.managedBlock(ForkJoinPool.java:3323) at java.util.concurrent.CompletableFuture.waitingGet(CompletableFuture.java:1742) at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1908) at org.apache.hadoop.hbase.util.FutureUtils.get(FutureUtils.java:182) at org.apache.hadoop.hbase.client.TableOverAsyncTable.put(TableOverAsyncTable.java:213) at org.apache.hadoop.hbase.master.assignment.RegionStateStore.updateRegionLocation(RegionStateStore.java:259) at org.apache.hadoop.hbase.master.assignment.RegionStateStore.updateRegionLocation(RegionStateStore.java:224) at org.apache.hadoop.hbase.master.assignment.AssignmentManager.regionClosedAbnormally(AssignmentManager.java:2076) at org.apache.hadoop.hbase.master.assignment.RegionRemoteProcedureBase.execute(RegionRemoteProcedureBase.java:305) at org.apache.hadoop.hbase.master.assignment.RegionRemoteProcedureBase.execute(RegionRemoteProcedureBase.java:57) at org.apache.hadoop.hbase.procedure2.Procedure.doExecute(Procedure.java:921) at org.apache.hadoop.hbase.procedure2.ProcedureExecutor.execProcedure(ProcedureExecutor.java:1650) at org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeProcedure(ProcedureExecutor.java:1396) at org.apache.hadoop.hbase.procedure2.ProcedureExecutor.access$1000(ProcedureExecutor.java:75) at org.apache.hadoop.hbase.procedure2.ProcedureExecutor$WorkerThread.runProcedure(ProcedureExecutor.java:1962) at org.apache.hadoop.hbase.procedure2.ProcedureExecutor$WorkerThread$$Lambda$610/726348606.call(Unknown Source) at org.apache.hadoop.hbase.trace.TraceUtil.trace(TraceUtil.java:216) at org.apache.hadoop.hbase.procedure2.ProcedureExecutor$WorkerThread.run(ProcedureExecutor.java:1989)
- All four KeepAlivePEWorker threads are blocked.
KeepAlivePEWorker-17 18 19 are blocked on RegionStateStore.updateRegionLocation
"KeepAlivePEWorker-17" #381 daemon prio=5 os_prio=0 tid=0x000056260b75d000 nid=0xeffb0 waiting on condition [0x00007f6a94339000] java.lang.Thread.State: WAITING (parking) at sun.misc.Unsafe.park(Native Method) - parking to wait for <0x00000004ebf83440> (a java.util.concurrent.CompletableFuture$Signaller) at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) at java.util.concurrent.CompletableFuture$Signaller.block(CompletableFuture.java:1707) at java.util.concurrent.ForkJoinPool.managedBlock(ForkJoinPool.java:3323) at java.util.concurrent.CompletableFuture.waitingGet(CompletableFuture.java:1742) at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1908) at org.apache.hadoop.hbase.util.FutureUtils.get(FutureUtils.java:182) at org.apache.hadoop.hbase.client.TableOverAsyncTable.put(TableOverAsyncTable.java:213) at org.apache.hadoop.hbase.master.assignment.RegionStateStore.updateRegionLocation(RegionStateStore.java:259) at org.apache.hadoop.hbase.master.assignment.RegionStateStore.updateRegionLocation(RegionStateStore.java:224) at org.apache.hadoop.hbase.master.assignment.AssignmentManager.transitStateAndUpdate(AssignmentManager.java:1982) at org.apache.hadoop.hbase.master.assignment.AssignmentManager.regionOpening(AssignmentManager.java:1997) at org.apache.hadoop.hbase.master.assignment.TransitRegionStateProcedure.openRegion(TransitRegionStateProcedure.java:279) at org.apache.hadoop.hbase.master.assignment.TransitRegionStateProcedure.executeFromState(TransitRegionStateProcedure.java:434) at org.apache.hadoop.hbase.master.assignment.TransitRegionStateProcedure.executeFromState(TransitRegionStateProcedure.java:111) at org.apache.hadoop.hbase.procedure2.StateMachineProcedure.execute(StateMachineProcedure.java:188) at org.apache.hadoop.hbase.master.assignment.TransitRegionStateProcedure.execute(TransitRegionStateProcedure.java:398) at org.apache.hadoop.hbase.master.assignment.TransitRegionStateProcedure.execute(TransitRegionStateProcedure.java:111) at org.apache.hadoop.hbase.procedure2.Procedure.doExecute(Procedure.java:921) at org.apache.hadoop.hbase.procedure2.ProcedureExecutor.execProcedure(ProcedureExecutor.java:1650) at org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeProcedure(ProcedureExecutor.java:1396) at org.apache.hadoop.hbase.procedure2.ProcedureExecutor.access$1000(ProcedureExecutor.java:75) at org.apache.hadoop.hbase.procedure2.ProcedureExecutor$WorkerThread.runProcedure(ProcedureExecutor.java:1962) at org.apache.hadoop.hbase.procedure2.ProcedureExecutor$WorkerThread$$Lambda$610/726348606.call(Unknown Source) at org.apache.hadoop.hbase.trace.TraceUtil.trace(TraceUtil.java:216) at org.apache.hadoop.hbase.procedure2.ProcedureExecutor$WorkerThread.run(ProcedureExecutor.java:1989)
- KeepAlivePEWorker-20 are blocked on RegionStateNode.lock
"KeepAlivePEWorker-20" #388 daemon prio=5 os_prio=0 tid=0x000056260b847800 nid=0xf02da waiting on condition [0x00007f6a92e25000] java.lang.Thread.State: WAITING (parking) at sun.misc.Unsafe.park(Native Method) - parking to wait for <0x00000004e3c8d990> (a java.util.concurrent.locks.ReentrantLock$NonfairSync) at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836) at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:870) at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1199) at java.util.concurrent.locks.ReentrantLock$NonfairSync.lock(ReentrantLock.java:209) at java.util.concurrent.locks.ReentrantLock.lock(ReentrantLock.java:285) at org.apache.hadoop.hbase.master.assignment.RegionStateNode.lock(RegionStateNode.java:323) at org.apache.hadoop.hbase.master.procedure.ServerCrashProcedure.assignRegions(ServerCrashProcedure.java:551) at org.apache.hadoop.hbase.master.procedure.ServerCrashProcedure.executeFromState(ServerCrashProcedure.java:243) at org.apache.hadoop.hbase.master.procedure.ServerCrashProcedure.executeFromState(ServerCrashProcedure.java:68) at org.apache.hadoop.hbase.procedure2.StateMachineProcedure.execute(StateMachineProcedure.java:188) at org.apache.hadoop.hbase.procedure2.Procedure.doExecute(Procedure.java:921) at org.apache.hadoop.hbase.procedure2.ProcedureExecutor.execProcedure(ProcedureExecutor.java:1650) at org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeProcedure(ProcedureExecutor.java:1396) at org.apache.hadoop.hbase.procedure2.ProcedureExecutor.access$1000(ProcedureExecutor.java:75) at org.apache.hadoop.hbase.procedure2.ProcedureExecutor$WorkerThread.runProcedure(ProcedureExecutor.java:1962) at org.apache.hadoop.hbase.procedure2.ProcedureExecutor$WorkerThread$$Lambda$610/726348606.call(Unknown Source) at org.apache.hadoop.hbase.trace.TraceUtil.trace(TraceUtil.java:216) at org.apache.hadoop.hbase.procedure2.ProcedureExecutor$WorkerThread.run(ProcedureExecutor.java:1989)
Attachments
Attachments
Issue Links
- links to