Uploaded image for project: 'HBase'
  1. HBase
  2. HBASE-27711

Regions permanently stuck in unknown_server state

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Major
    • Resolution: Unresolved
    • 2.4.11
    • None
    • Region Assignment
    • None
    • HBase: 2.4.11
      Hadoop: 3.2.4
      ZooKeeper: 3.7.1

    Description

      We see this log message and the regions listed are never put back into service without manual intervention:

      NodeC hbasemaster-0 hbasemaster 2023-02-15 14:15:56,149 WARN  [master/NodeC:16000.Chore.1] janitor.CatalogJanitor: unknown_server=NodeA,16201,1676468874221/__test-table_NodeA__,,1672786676251.a3cac9159205d7611c85dd5c4feeded7., unknown_server=NodeA,16201,1676468874221/__test-table_NodeB__,,1672786676579.50e948f0a5bc962aabfe27e9ea4227a5., unknown_server=NodeA,16201,1676468874221/aeris_v2,,1672786736251.6ab0292cca294784bce8415cc69c30d4., unknown_server=NodeA,16201,1676468874221/aeris_v2,\x06,1672786736251.15d958805892370907a47f31a6e08db1., unknown_server=NodeA,16201,1676468874221/aeris_v2,\x12,1672786736251.ac3c78ff6903f52d9e2acf80b8436085.

       
      Normally when we see these unknown_server logs, they do get resolved by reassigning the regions, however we have a reproducible case where this doesn't happen.

      When this occurs we also see the following log messages related to the regions:

      NodeC hbasemaster-0 hbasemaster 2023-02-15 14:10:59,810 WARN  [RpcServer.priority.RWQ.Fifo.write.handler=0,queue=0,port=16000] assignment.AssignmentManager: Reporting NodeC,16201,1676469549542 server does not match state=OPEN, location=NodeA,16201,1676468874221, table=aeris_v2, region=6ab0292cca294784bce8415cc69c30d4 (time since last update=3749ms); closing…
      NodeC hbasemaster-0 hbasemaster 2023-02-15 14:11:00,323 WARN  [RpcServer.priority.RWQ.Fifo.write.handler=0,queue=0,port=16000] assignment.AssignmentManager: No matching procedure found for C,16201,1676469549542 transition on state=OPEN, location=NodeA,16201,1676468874221, table=aeris_v2, region=6ab0292cca294784bce8415cc69c30d4 to CLOSED
      

       
      This suggests that the master has a different mapping of region to region server than is expected so it closes the region. We would expect that the regions get assigned somewhere else and then reopened, but we are not seeing that.

      This log message comes from here: https://github.com/apache/hbase/blob/branch-2.4/hbase-server/src/main/java/org/apache/hadoop/hbase/master/assignment/AssignmentManager.java#L1292

      The next thing that is done is calling AssignmentManager's closeRegionServerSilently method.

      Our setup:

      We have a three server cluster that runs a full HBASE stack: 3 zookeeper nodes, an HBASE master active and standby, 3 region servers, 3 HDFS data nodes. For reliability testing we are running a script that will restart one of the three servers, which will have running on it a region server, zookeeper and HDFS process, and possibly also the HBASE master primary or standby.

      In this test we saw the issue after NodeB had been killed at 14:08:33, which had been running the active master, so the master did switchover to NodeC. Then at 14:12:56 we saw a "STUCK Region-In-Transition" log for a region on NodeA (this is another common reproducible issue we plan to open a ticket for) and then restarted just the region server process on NodeA to get that region reassigned.

      Attachments

        1. config.txt
          6 kB
          Aaron Beitch

        Activity

          People

            Unassigned Unassigned
            aaronb Aaron Beitch
            Votes:
            0 Vote for this issue
            Watchers:
            7 Start watching this issue

            Dates

              Created:
              Updated: