Uploaded image for project: 'HBase'
  1. HBase
  2. HBASE-28109

NPE for the region state: Failed to become active master (HMaster)

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 2.4.17
    • 2.6.0, 2.4.18, 2.5.6, 3.0.0-beta-1
    • master
    • None
    • Reviewed

    Description

      When starting up HBase cluster (2.4.17), I met NPE and it prevents HMaster from starting up. I have to restart the HMaster.

      My cluster contains 1 HMaster, 2 RS (HBase-2.4.17) and 1 Hadoop node (2.10.2).

      2023-09-18 14:17:35,931 INFO  [PEWorker-1] procedure2.ProcedureExecutor: Rolled back pid=1, state=ROLLEDBACK, exception=org.apache.hadoop.hbase.exceptions.TimeoutIOException via ProcedureExecutor:org.apache.hadoop.hbase.exceptions.TimeoutIOException: Operation timed out after 1.0010 sec; InitMetaProcedure table=hbase:meta exec-time=1.4660 sec
      2023-09-18 14:17:35,931 INFO  [master/hmaster:16000:becomeActiveMaster] master.HMaster: Wait for region servers to report in: status=null, state=RUNNING, startTime=1695046655931, completionTime=-1
      2023-09-18 14:17:35,932 INFO  [master/hmaster:16000:becomeActiveMaster] master.ServerManager: Waiting on regionserver count=2; waited=0ms, expecting min=1 server(s), max=NO_LIMIT server(s), timeout=4500ms, lastChange=0ms
      2023-09-18 14:17:37,438 INFO  [master/hmaster:16000:becomeActiveMaster] master.ServerManager: Waiting on regionserver count=2; waited=1505ms, expecting min=1 server(s), max=NO_LIMIT server(s), timeout=4500ms, lastChange=1505ms
      2023-09-18 14:17:38,941 INFO  [master/hmaster:16000:becomeActiveMaster] master.ServerManager: Waiting on regionserver count=2; waited=3009ms, expecting min=1 server(s), max=NO_LIMIT server(s), timeout=4500ms, lastChange=3009ms
      2023-09-18 14:17:40,445 INFO  [master/hmaster:16000:becomeActiveMaster] master.ServerManager: Finished waiting on RegionServer count=2; waited=4513ms, expected min=1 server(s), max=NO_LIMIT server(s), master is running
      2023-09-18 14:17:40,452 ERROR [master/hmaster:16000:becomeActiveMaster] master.HMaster: Failed to become active master
      java.lang.NullPointerException
              at org.apache.hadoop.hbase.master.HMaster.isRegionOnline(HMaster.java:1229)
              at org.apache.hadoop.hbase.master.HMaster.waitForMetaOnline(HMaster.java:1218)
              at org.apache.hadoop.hbase.master.HMaster.finishActiveMasterInitialization(HMaster.java:968)
              at org.apache.hadoop.hbase.master.HMaster.startActiveMasterManager(HMaster.java:2193)
              at org.apache.hadoop.hbase.master.HMaster.lambda$run$0(HMaster.java:528)
              at java.lang.Thread.run(Thread.java:750)
      2023-09-18 14:17:40,453 ERROR [master/hmaster:16000:becomeActiveMaster] master.HMaster: Master server abort: loaded coprocessors are: [org.apache.hadoop.hbase.quotas.MasterQuotasObserver] 

      Root Cause

      From the stack trace, the rs variable is NULL and it's directly used without checking.

      // hbase-server/src/main/java/org/apache/hadoop/hbase/master/HMaster.java
      
        /**
         * @return True if region is online and scannable else false if an error or shutdown (Otherwise we
         *         just block in here holding up all forward-progess).
         */
        private boolean isRegionOnline(RegionInfo ri) {
          RetryCounter rc = null;
          while (!isStopped()) {
            // NPE line
            RegionState rs = this.assignmentManager.getRegionStates().getRegionState(ri);
            if (rs.isOpened()) {
              if (this.getServerManager().isServerOnline(rs.getServerName())) {
                return true;
              }
            }
            // Region

      I am not sure what causes the rs to be null but maybe we can add a check to make sure this NPE is captured and properly handled.

      Restart the HMaster and this exception will disappear. I have attached the full log from HMaster for this case. I run into this exception when using HBase 2.4.17 but I think it might also happen in the latest branch since the code of isRegionOnline is the same.

      Fix

      This bug happens rarely. I think we can add a simple check to know whether rs is null and then decide whether to keep waiting or directly shutdown the HMaster.

      I assume that if HMaster wait for more time, it will get correct responses from regionservers.

      I have a simple PR to fix it.

      https://github.com/apache/hbase/pull/5432

      Attachments

        Issue Links

          Activity

            People

              kehan5800 Ke Han
              kehan5800 Ke Han
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: