Uploaded image for project: 'HBase'
  1. HBase
  2. HBASE-21307

[amv2] Deadlock when we move a Region from a not-online RegionServer

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Critical
    • Resolution: Duplicate
    • 2.1.1
    • 2.1.1
    • amv2
    • None

    Description

      Perhaps this doesn't happen in branch-2, but its problem in branch-2.1.

      Highlevel, we go to move a region, its unassign subprocedure fails its dispatch because the server is not online so it queues a SCP and waits on it to break the RPC. The SCP can't run though because the MRP holds lock on the region.

      I can bypass the MRP but then the SCP fails because Region is 'owned' by the MRP. See below:

      2018-10-12 16:29:53,423 INFO org.apache.hadoop.hbase.procedure2.ProcedureExecutor: Begin bypass pid=411982, ppid=411981, state=RUNNABLE:REGION_TRANSITION_DISPATCH, locked=true; UnassignProcedure table=IntegrationTestBigLinkedList_20180709093726, region=f5f9ff1e4b0f2d9555dabfcca71df568, override=true, server=va1002.halxg.cloudera.com,22101,1539368318649 with lockWait=0, override=true, recursive=true
      2018-10-12 16:29:53,424 INFO org.apache.hadoop.hbase.procedure2.ProcedureExecutor: Bypassing pid=411982, ppid=411981, state=RUNNABLE:REGION_TRANSITION_DISPATCH, locked=true; UnassignProcedure table=IntegrationTestBigLinkedList_20180709093726, region=f5f9ff1e4b0f2d9555dabfcca71df568, override=true, server=va1002.halxg.cloudera.com,22101,1539368318649
      2018-10-12 16:29:53,712 INFO org.apache.hadoop.hbase.procedure2.ProcedureExecutor: Bypassing pid=411981, state=WAITING:MOVE_REGION_ASSIGN, locked=true; MoveRegionProcedure hri=f5f9ff1e4b0f2d9555dabfcca71df568, source=va1002.halxg.cloudera.com,22101,1539368318649, destination=vd1021.halxg.cloudera.com,22101,1539368317897
      2018-10-12 16:29:53,838 INFO org.apache.hadoop.hbase.procedure2.ProcedureExecutor: Bypassing pid=411982, ppid=411981, state=RUNNABLE:REGION_TRANSITION_DISPATCH, locked=true, bypass=LOG-REDACTED UnassignProcedure table=IntegrationTestBigLinkedList_20180709093726, region=f5f9ff1e4b0f2d9555dabfcca71df568, override=true, server=va1002.halxg.cloudera.com,22101,1539368318649 and its ancestors successfully, adding to queue
      2018-10-12 16:29:53,839 INFO org.apache.hadoop.hbase.procedure2.Procedure: pid=411982, ppid=411981, state=RUNNABLE:REGION_TRANSITION_DISPATCH, locked=true, bypass=LOG-REDACTED UnassignProcedure table=IntegrationTestBigLinkedList_20180709093726, region=f5f9ff1e4b0f2d9555dabfcca71df568, override=true, server=va1002.halxg.cloudera.com,22101,1539368318649 bypassed, returning null to finish it
      2018-10-12 16:29:53,954 INFO org.apache.hadoop.hbase.procedure2.ProcedureExecutor: Finished subprocedure pid=411982, resume processing parent pid=411981, state=RUNNABLE:MOVE_REGION_ASSIGN, locked=true, bypass=LOG-REDACTED MoveRegionProcedure hri=f5f9ff1e4b0f2d9555dabfcca71df568, source=va1002.halxg.cloudera.com,22101,1539368318649, destination=vd1021.halxg.cloudera.com,22101,1539368317897
      2018-10-12 16:29:53,954 INFO org.apache.hadoop.hbase.procedure2.Procedure: pid=411981, state=RUNNABLE:MOVE_REGION_ASSIGN, locked=true, bypass=LOG-REDACTED MoveRegionProcedure hri=f5f9ff1e4b0f2d9555dabfcca71df568, source=va1002.halxg.cloudera.com,22101,1539368318649, destination=vd1021.halxg.cloudera.com,22101,1539368317897 bypassed, returning null to finish it
      2018-10-12 16:29:53,956 INFO org.apache.hadoop.hbase.procedure2.ProcedureExecutor: Finished pid=411982, ppid=411981, state=SUCCESS, bypass=LOG-REDACTED UnassignProcedure table=IntegrationTestBigLinkedList_20180709093726, region=f5f9ff1e4b0f2d9555dabfcca71df568, override=true, server=va1002.halxg.cloudera.com,22101,1539368318649 in 3hrs, 49mins, 12.419sec, unfinishedSiblingCount=0
      2018-10-12 16:29:54,058 INFO org.apache.hadoop.hbase.procedure2.ProcedureExecutor: Finished pid=411981, state=SUCCESS, bypass=LOG-REDACTED MoveRegionProcedure hri=f5f9ff1e4b0f2d9555dabfcca71df568, source=va1002.halxg.cloudera.com,22101,1539368318649, destination=vd1021.halxg.cloudera.com,22101,1539368317897 in 3hrs, 49mins, 12.878sec
      2018-10-12 16:29:54,059 INFO org.apache.hadoop.hbase.master.procedure.MasterProcedureScheduler: xlock for pid=412210, ppid=411983, state=RUNNABLE:REGION_TRANSITION_QUEUE; AssignProcedure table=IntegrationTestBigLinkedList_20180709093726, region=f5f9ff1e4b0f2d9555dabfcca71df568
      2018-10-12 16:29:54,105 WARN org.apache.hadoop.hbase.master.assignment.RegionTransitionProcedure: f5f9ff1e4b0f2d9555dabfcca71df568 owned by pid=411982, CANNOT run 'this' (pid=412210).
      ....
      

      Attachments

        Activity

          People

            stack Michael Stack
            stack Michael Stack
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: