[HBASE-27711] Regions permanently stuck in unknown_server state - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Open
Priority: Major
Resolution: Unresolved
Affects Version/s: 2.4.11
Fix Version/s: None
Component/s: Region Assignment
Labels:
None
Environment:

HBase: 2.4.11
Hadoop: 3.2.4
ZooKeeper: 3.7.1

Description

We see this log message and the regions listed are never put back into service without manual intervention:

NodeC hbasemaster-0 hbasemaster 2023-02-15 14:15:56,149 WARN  [master/NodeC:16000.Chore.1] janitor.CatalogJanitor: unknown_server=NodeA,16201,1676468874221/__test-table_NodeA__,,1672786676251.a3cac9159205d7611c85dd5c4feeded7., unknown_server=NodeA,16201,1676468874221/__test-table_NodeB__,,1672786676579.50e948f0a5bc962aabfe27e9ea4227a5., unknown_server=NodeA,16201,1676468874221/aeris_v2,,1672786736251.6ab0292cca294784bce8415cc69c30d4., unknown_server=NodeA,16201,1676468874221/aeris_v2,\x06,1672786736251.15d958805892370907a47f31a6e08db1., unknown_server=NodeA,16201,1676468874221/aeris_v2,\x12,1672786736251.ac3c78ff6903f52d9e2acf80b8436085.

Normally when we see these unknown_server logs, they do get resolved by reassigning the regions, however we have a reproducible case where this doesn't happen.

When this occurs we also see the following log messages related to the regions:

NodeC hbasemaster-0 hbasemaster 2023-02-15 14:10:59,810 WARN  [RpcServer.priority.RWQ.Fifo.write.handler=0,queue=0,port=16000] assignment.AssignmentManager: Reporting NodeC,16201,1676469549542 server does not match state=OPEN, location=NodeA,16201,1676468874221, table=aeris_v2, region=6ab0292cca294784bce8415cc69c30d4 (time since last update=3749ms); closing…
NodeC hbasemaster-0 hbasemaster 2023-02-15 14:11:00,323 WARN  [RpcServer.priority.RWQ.Fifo.write.handler=0,queue=0,port=16000] assignment.AssignmentManager: No matching procedure found for C,16201,1676469549542 transition on state=OPEN, location=NodeA,16201,1676468874221, table=aeris_v2, region=6ab0292cca294784bce8415cc69c30d4 to CLOSED

This suggests that the master has a different mapping of region to region server than is expected so it closes the region. We would expect that the regions get assigned somewhere else and then reopened, but we are not seeing that.

This log message comes from here: https://github.com/apache/hbase/blob/branch-2.4/hbase-server/src/main/java/org/apache/hadoop/hbase/master/assignment/AssignmentManager.java#L1292

The next thing that is done is calling AssignmentManager's closeRegionServerSilently method.

Our setup:

We have a three server cluster that runs a full HBASE stack: 3 zookeeper nodes, an HBASE master active and standby, 3 region servers, 3 HDFS data nodes. For reliability testing we are running a script that will restart one of the three servers, which will have running on it a region server, zookeeper and HDFS process, and possibly also the HBASE master primary or standby.

In this test we saw the issue after NodeB had been killed at 14:08:33, which had been running the active master, so the master did switchover to NodeC. Then at 14:12:56 we saw a "STUCK Region-In-Transition" log for a region on NodeA (this is another common reproducible issue we plan to open a ticket for) and then restarted just the region server process on NodeA to get that region reassigned.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

config.txt
15/Mar/23 20:06
6 kB
Aaron Beitch

Activity

People

Assignee:: Unassigned

Reporter:: Aaron Beitch

Votes:: 0 Vote for this issue

Watchers:: 7 Start watching this issue

Dates

Created:: 13/Mar/23 21:20

Updated:: 16/Mar/23 18:52