Uploaded image for project: 'HBase'
  1. HBase
  2. HBASE-23693

Split failure may cause region hole and data loss when use zk assign

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Critical
    • Resolution: Fixed
    • 1.4.8
    • 1.6.0
    • master
    • None
    • split

    Description

      to mock this case, I add a sleep code in SplitTransactionImpl.excute after the PONR and before openDaughters:

      public PairOfSameType<Region> execute(final Server server,
            final RegionServerServices services, User user) throws IOException {
          this.server = server;
          this.rsServices = services;
          useZKForAssignment = server == null ? true :
            ConfigUtil.useZKForAssignment(server.getConfiguration());
          if (useCoordinatedStateManager(server)) {
            std =
                ((BaseCoordinatedStateManager) server.getCoordinatedStateManager())
                    .getSplitTransactionCoordination().getDefaultDetails();
          }
          PairOfSameType<Region> regions = createDaughters(server, services, user);
          if (this.parent.getCoprocessorHost() != null) {
            if (user == null) {
              parent.getCoprocessorHost().preSplitAfterPONR();
            } else {
              try {
                user.getUGI().doAs(new PrivilegedExceptionAction<Void>() {
                  @Override
                  public Void run() throws Exception {
                    parent.getCoprocessorHost().preSplitAfterPONR();
                    return null;
                  }
                });
              } catch (InterruptedException ie) {
                InterruptedIOException iioe = new InterruptedIOException();
                iioe.initCause(ie);
                throw iioe;
              }
            }
          }
          
          //sleep here!!!
          try {
            Thread.sleep(1000 * 60 * 60);
          } catch (InterruptedException e) {
            e.printStackTrace();
          }
      
          regions = stepsAfterPONR(server, services, regions, user);
      
          transition(SplitTransactionPhase.COMPLETED);
      
          return regions;
        }
      

      so the split transaction will hang.

      then i try to reproduce this problem:

      1.Create a test table and move it into a test rsgroup, there is only 1 RS in the test group

      2.Trigger a region split

      3.The split transaction step after the PONR and sleep, regioninfo in meta has been updated

      4.Kill the RS process to mock machine crash

      5.ServerCrashProcedure cleanup SPLITING_NEW region, the daughter regions will be deleted

      6.ServerCrashProcedure try to assign the parent region, because RS is down and assign fails, the region status is set to FAILED_OPEN and put back into regionsInTransition. But at this time, due to RS crash, the node of the region under ZK region-in-transition no longer exist

      7.CatalogJanitor thread is blocked due to RIT

      8.Switch active master

      9.The CatalogJanitor thread on the new master executes normally and the parent region is cleaned up because split = true && offline = true in the meta table

      10.We have a hole in the test table and loss data.

       

      I modified the code when ServerCrashProcedure cleans up the child regions, it will update the parent regioninfo in the meta table, and this problem is no longer reproduced.

      I will upload the patch later.

      Attachments

        1. HBASE-23693.branch-1.001.patch
          8 kB
          tianhang tang

        Activity

          People

            tangtianhang tianhang tang
            tangtianhang tianhang tang
            Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: