Uploaded image for project: 'HBase'
  1. HBase
  2. HBASE-24368

Let HBCKSCP clear 'Unknown Servers', even if RegionStateNode has RegionLocation == null

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 2.3.0
    • 3.0.0-alpha-1, 2.3.0
    • hbck2
    • None
    • Reviewed

    Description

      This is an incidental noticed when in a hole trying to fix up a cluster. The 'obvious' remediation didn't work. This issue is about addressing this.

      HBASE-23594 added a filtering of Regions on the crashed server to handle the case where an Assign may be concurrent to the ServerCrashProcedure. To avoid double assign, the SCP will skip assign if the RegionStateNode RegionLocation is not that of the crashed server.

      This is good.

      Where it is an obstacle is when a Region is stuck in OPENING state, it references an 'Unknown Server' – a server no longer tracked by the Master – and there is no assign currently in flight. In this case, scheduling a ServerCrashProcedure to clean up the reference to the Unknown Server and to get the Region reassigned skips out when RegionStateNode in Master has a RegionLocation that does not match that of the ServerCrashProcedure, even when it is set to null (we set the RegionLocation to null when we fail an assign as we might if the server no longer is part of the cluster).

      For background, cluster had a RIT. The RIT was a Region failing to open because of a missing Reference (Another issue). The Region open would fail with a FileNotFoundException. The master would attempt assign and then would fail when it went to confirm OPEN, logging the complaint about FNFE asking for operator intervention in master logs.

      This state was in place for weeks on this particular cluster (a dev cluster not under close observation). The cluster had been restarted once or twice so the server the Region had once been on was no longer 'known' but it still had an entry in the hbase:meta table as last location assigned (The now 'Unknown Server').

      To fix, I went about the task in the wrong order. I bypassed the long-running stuck procedure to terminate it and cleanup 'Procedures and Locks'. Mistake. Now there was no longer an assign Procedure for this Region. But I now had a Region in OPENING state with a reference to an unknown server with an in-memory RegionStateNode whose RegionLocation was null (set null on each failed assign). Running catalogjanitor_run and hbck_chore_report had the unknown server show in the 'HBCK Report' in the 'Unknown Servers' list. Attempts at assign fail because Region is in OPENING state – you can't assign a Region in OPENING state. Scheduling an HBCKSCP via hbck2 scheduleRecoveries always generated the below in the logs.

      org.apache.hadoop.hbase.master.procedure.ServerCrashProcedure: pid=157217, state=RUNNABLE:SERVER_CRASH_ASSIGN, locked=true; HBCKServerCrashProcedure server=unknown_server.example.com,16020,1587577972683, splitWal=true, meta=false found a region state=OPENING, location=null, table=bobby_analytics, region=1501ea3bd822c1a3e4e6216ea48733bd which is no longer on us unknown_server.example.com,16020,1587577972683, give up assigning...
      

      My workaround was setting region state to CLOSED with hbck2 and then doing an assign with hbck2. At this point I noticed the FNFE. Easier if the HBCKSCP worked.

      Attachments

        Issue Links

          Activity

            People

              stack Michael Stack
              stack Michael Stack
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: