Uploaded image for project: 'HBase'
  1. HBase
  2. HBASE-14129

If any regionserver gets shutdown uncleanly during full cluster restart, locality looks to be lost

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Won't Fix
    • None
    • None
    • None
    • None

    Description

      We were doing a cluster restart the other day. Some regionservers did not shut down cleanly. Upon restart our locality went from 99% to 5%. Upon looking at the AssignmentManager.joinCluster() code it calls AssignmentManager.processDeadServersAndRegionsInTransition().
      If the failover flag gets set for any reason it seems we don't call assignAllUserRegions(). Then it looks like the balancer does the work in assigning those regions, we don't use a locality aware balancer and we lost our region locality.

      I don't have a solid grasp on the reasoning for these checks but there could be some potential workarounds here.

      1. After shutting down your cluster, move your WALs aside (replay later).
      2. Clean up your zNodes

      That seems to work, but requires a lot of manual labor. Another solution which I prefer would be to have a flag for ./start-hbase.sh --clean

      If we start master with that flag then we do a check in AssignmentManager.processDeadServersAndRegionsInTransition() thus if this flag is set we call: assignAllUserRegions() regardless of the failover state.

      I have a patch for the later solution, that is if I am understanding the logic correctly.

      Attachments

        1. HBASE-14129.patch
          4 kB
          churro morales

        Issue Links

          Activity

            People

              Unassigned Unassigned
              churromorales churro morales
              Votes:
              0 Vote for this issue
              Watchers:
              10 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: