Uploaded image for project: 'HBase'
  1. HBase
  2. HBASE-28420

Aborting Active HMaster is not rejecting remote Procedure Reports

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Critical
    • Resolution: Unresolved
    • 2.4.17, 2.5.8
    • None
    • master, proc-v2

    Description

      When the Active Hmaster is in the process of abortion and another HMaster is becoming Active HMaster,at the same time if any region server reports the completion of the remote procedure, it generally goes to the old active HMaster because of the cached value of rssStub -> code (caller method). On the Master side (code), It did check if the service is started but that returns true if the master is in the process of abortion(I didn't see when we are setting this flag false while abortion).  

      This issue becomes critical when ServerCrash of meta hosting RS and master failover happens at the same time and hbase:meta got stuck in the offline state.

      Logs for abortion start of HMaster 

      2024-02-02 07:33:11,581 ERROR [PEWorker-6] master.HMaster - ***** ABORTING master server4-1xxx,61000,1705169084562:
      FAILED persisting region=52d36581218e00a2668776cfea897132 state=CLOSING *****
      2024-02-02 07:33:40,999 INFO [master/server4-1xxx:61000] regionserver.HRegionServer - Exiting; 
      stopping=hbase2b-mnds4-1-ia2.ops.sfdc.net,61000,1705169084562; zookeeper connection closed.

      it took almost 30 seconds to abort the HMaster.

       

      Logs of starting SCP for meta carrying host. (This SCP is started by the new active HMaster)

      2024-02-02 07:33:32,622 INFO [aster/server3-1xxx61000:becomeActiveMaster] assignment.AssignmentManager - Scheduled
      ServerCrashProcedure pid=3305546 for server5-1xxx61020,1706857451955 (carryingMeta=true) server5-1-
      xxx61020,1706857451955/CRASHED/regionCount=1/lock=java.util.concurrent.locks.ReentrantReadWriteLock@1b0a5293[Write 
      locks = 1, Read locks = 0], oldState=ONLINE.

      initialization of remote procedure

      2024-02-02 07:33:33,178 INFO [PEWorker-4] procedure2.ProcedureExecutor - Initialized subprocedures=[{pid=3305548, 
      ppid=3305547, state=RUNNABLE; SplitWALRemoteProcedure server5-1-
      xxxxt%2C61020%2C1706857451955.meta.1706858156058.meta, worker=server4-1-xxxx,61020,1705169180881}]

      Logs of remote procedure handling on Old Active Hmaster(server4-1xxx,61000) (in the process of abortion)

      2024-02-02 07:33:37,990 DEBUG [r.default.FPBQ.Fifo.handler=243,queue=9,port=61000] master.HMaster - Remote procedure 
      done, pid=3305548

      This should be handled by the new active HMaster so that it can wake up the suspended Procedure on the new Active Hmaster. As the new ActiveHMaster was not able to wake that up, SCP procedure got stuck thus meta stayed OFFLINE. 

       

      Logs of Hmaster trying to becomeActivehmaster but stuck-

      2024-02-02 07:33:43,159 WARN [aster/server3-1-ia2:61000:becomeActiveMaster] master.HMaster - hbase:meta,,1.1588230740 
      is NOT online; state={1588230740 state=OPEN, ts=1706859212481, server=server5-1-xxx,61020,1706857451955}; 
      ServerCrashProcedures=true. Master startup cannot progress, in holding-pattern until region onlined.

      After this master was stuck till we did hmaster failover to come out of this situation. 

      Attachments

        Activity

          People

            umesh9414 Umesh Kumar Kumawat
            umesh9414 Umesh Kumar Kumawat
            Votes:
            1 Vote for this issue
            Watchers:
            7 Start watching this issue

            Dates

              Created:
              Updated: