Uploaded image for project: 'HBase'
  1. HBase
  2. HBASE-24286

HMaster won't become healthy after after cloning or creating a new cluster pointing at the same file system

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Won't Fix
    • 3.0.0-alpha-1, 2.2.3, 2.2.4, 2.2.5
    • None
    • master, Region Assignment
    • None

    Description

      How to reproduce:

      1. user starts an HBase cluster on top of a file system
      2. user performs some operations and shuts down the cluster, all the data are still persisted in the file system
      3. user creates a new HBase cluster using a different set of servers on top of the same file system with the same root directory
      4. HMaster cannot initialize

      Root cause:

      During HMaster initialization phase, the following happens:

      1. HMaster waits for namespace table online
      2. AssignmentManager gets all namespace table regions info
      3. region servers of namespace table are already dead, online check fails
      4. HMaster waits for namespace regions online, keep retrying for 1000 times which means forever

      Code waiting for namespace table to be online: https://github.com/apache/hbase/blob/rel/2.2.3/hbase-server/src/main/java/org/apache/hadoop/hbase/master/HMaster.java#L1102

      Stack trace (running on S3):

      2020-04-23 08:15:57,185 WARN [master/ip-10-12-13-14:16000:becomeActiveMaster] master.HMaster: hbase:namespace,,1587628169070.d34b65b91a52644ed3e77c5fbb065c2b. is NOT online; state={d34b65b91a52644ed3e77c5fbb065c2b state=OPEN, ts=1587629742129, server=ip-10-12-13-14.ec2.internal,16020,1587628031614}; ServerCrashProcedures=false. Master startup cannot progress, in holding-pattern until region onlined.

      where ip-10-12-13-14.ec2.internal is the old region server hosting the region of hbase:namespace.

      Discussion for the fix

      We see there is a fix for this at branch-3: https://issues.apache.org/jira/browse/HBASE-21154. Before we provide a patch, we would like to know from the community if we should backport this change to branch-2, or if we should just perform a fix with minimum code change.

      Attachments

        Issue Links

          Activity

            People

              taklwu Tak-Lon (Stephen) Wu
              jackye Jack Ye
              Votes:
              1 Vote for this issue
              Watchers:
              14 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: