[HBASE-23984] [Flakey Tests] TestMasterAbortAndRSGotKilled fails in teardown - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Test
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 2.3.0, 2.4.0
Fix Version/s: 3.0.0-alpha-1, 2.3.0
Component/s: None
Labels:
None

Hadoop Flags:

Reviewed

Description

Its failing with decent frequency of late in shutdown of cluster. Seems basic. There is an unassign/move going on. Test just checks Master can come back up after being killed. Does not check move is done. If on subsequent cluster shutdown, if the move can't report the Master because its shutting down, then the move fails, we abort the server, and then we get a wonky loop where we can't close because server is aborting.

At the root, there is a misaccounting when the unassign close fails where we don't cleanup references in the regionserver local RIT accounting. Deeper than this, close code is duplicated in three places that I can see; in RegionServer, in CloseRegionHandler, and in UnassignRegionHandler.

Let me fix this issue and the code dupe.

Details:

From https://builds.apache.org/job/HBase-Flaky-Tests/job/branch-2/5733/artifact/hbase-server/target/surefire-reports/org.apache.hadoop.hbase.master.TestMasterAbortAndRSGotKilled-output.txt

Here is the unassign handler failing because master went down earlier (Its probably trying to talk to the old Master location)

***** ABORTING region server asf905.gq1.ygridcore.net,32989,1584000644108: Failed to close region ede67f9f661acc1241faf468b081d548 and can not recover *****
Cause:
java.io.IOException: Failed to report close to master: ede67f9f661acc1241faf468b081d548
	at org.apache.hadoop.hbase.regionserver.handler.UnassignRegionHandler.process(UnassignRegionHandler.java:125)
	at org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:104)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)

... then the cluster shutdown tries to close the same Region... but fails because we are aborting because of above....

2020-03-12 08:11:16,600 ERROR [RS_CLOSE_REGION-regionserver/asf905:0-0] helpers.MarkerIgnoringBase(159): ***** ABORTING region server asf905.gq1.ygridcore.net,32989,1584000644108: Unrecoverable exception while closing region hbase:namespace,,1584000652744.78f4ae5beda711a9bebad0b6b8376cc9., still finishing close *****
java.io.IOException: Aborting flush because server is aborted...
	at org.apache.hadoop.hbase.regionserver.HRegion.internalPrepareFlushCache(HRegion.java:2545)
	at org.apache.hadoop.hbase.regionserver.HRegion.internalFlushcache(HRegion.java:2530)
	at org.apache.hadoop.hbase.regionserver.HRegion.internalFlushcache(HRegion.java:2504)
	at org.apache.hadoop.hbase.regionserver.HRegion.internalFlushcache(HRegion.java:2495)
	at org.apache.hadoop.hbase.regionserver.HRegion.doClose(HRegion.java:1650)
	at org.apache.hadoop.hbase.regionserver.HRegion.close(HRegion.java:1552)
	at org.apache.hadoop.hbase.regionserver.handler.CloseRegionHandler.process(CloseRegionHandler.java:110)
	at org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:104)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)

....

And the RS keeps looping trying to close the Region even though we're aborted and there is handling in RS close Regions to deal with abort.

Trouble seems to be because when UnassignRegionHandler fails its region close, it does not unregister the Region with rs.getRegionsInTransitionInRS().remove(encodedNameBytes, Boolean.FALSE);

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

Screen Shot 2020-03-17 at 9.46.49 PM.png
18/Mar/20 04:47
14 kB
Michael Stack
0001-HBASE-23984-Flakey-Tests-TestMasterAbortAndRSGotKill.patch
19/Mar/20 20:49
46 kB
Michael Stack

Issue Links

is duplicated by

HBASE-23985 [flakey test] TestZooKeeper

Resolved

is related to

HBASE-24015 Coverage for Assign and Unassign of Regions on RegionServer on failure

Resolved

relates to

HBASE-24026 Revisit cluster shutdown

Open

links to

GitHub Pull Request #1305

GitHub Pull Request #1311

Sub-Tasks

1.	Add in comments, minor refactorings, and review feedback from parent HBASE-23984		Open	Unassigned
2.	Fix flaky TestMasterAbortAndRSGotKilled for branch-2.2		Resolved	Guanghao Zhang

Activity

People

Assignee:: Michael Stack

Reporter:: Michael Stack

Votes:: 0 Vote for this issue

Watchers:: 6 Start watching this issue

Dates

Created:: 14/Mar/20 00:18

Updated:: 14/Aug/20 08:31

Resolved:: 20/Mar/20 22:29