Uploaded image for project: 'HBase'
  1. HBase
  2. HBASE-21920

Ignoring 'empty' end_key while calculating end_key for new region in HBCK -fixHdfsOverlaps command can cause data loss

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • None
    • 1.4.10, 1.3.5
    • hbck
    • None

    Description

      When running -fixHdfsOverlaps command due to overlap in the regions of the table ,it moves all the hfiles of overlapping regions into new region with start_key and end_key calculating based on minimum and maximum start_key and end_key of all overlapping regions.

      When calculating start_key and end_key for new region,end_key with 'empty' is not considered which leads to data loss when scanned using 'startrow'.

      For example:
      1.create table 't'
      2.Insert records {00,111,200} into the table 't'and flush the data
      3.split the table 't' with split-key '100'
      4.Now we have three regions( 1 parent and two daughter regions )
      1.Region-1('Empty','Empty') => {00,111,200}
      2.Region-2('Empty','100')=>{00}
      3.Region-3('100','Empty')=>{111,200}

      5.Make sure parent region is not deleted in file system and run -fixHdfsOverlaps command

      This -fixHdfsOverlaps command will move all the hfiles of the three regions

      {Region-1,Region- 2,Region-3} into a new region(Region-4) created with start_key='Empty' and end_key='100'

      This is because it does not consider  end_key='Empty' and considers end_key='100' as maximum which in turn makes all the hfiles of three regions to move into new region even if records in hfile is more than the end_key='100' and one empty region Region -5   (100,Empty) will be created because table region end key was not empty.

      Now we have 2 regions:

      1.Region-4(Empty,100)=>{00,111,200}

      2.Region-5(100,Empty)=>{}

      when the entire table scan is done, all the records will be displayed, there wont be any data loss but scan with start_key is done below are the results:

      1.scan 't', { STARTROW => '00'} => {00,111,200}

      2.scan 't', { STARTROW => '100'}=>{}

      The second scan will give empty result because it searches the rows in

      Region -5(100,Empty) which contains no records but records {111,200} is present in Region-4(Empty,100).

      The problem exists only when end_key='Empty' is present in any of the overlapping regions.I think if end_key is present in any of the overlapping regions,we have to consider it as maximum end_key.

      Attachments

        1. HBASE-21920.branch-1.patch
          2 kB
          Syeda Arshiya Tabreen
        2. HBASE-21920.branch-1.001.patch
          4 kB
          Syeda Arshiya Tabreen
        3. HBASE-21920.branch-1.002.patch
          4 kB
          Syeda Arshiya Tabreen
        4. HBASE-21920.branch-1.002.patch
          4 kB
          Syeda Arshiya Tabreen
        5. HBASE-21920.branch-1.002.patch
          4 kB
          Syeda Arshiya Tabreen
        6. HBASE-21920.branch-1.002.patch
          4 kB
          Toshihiro Suzuki

        Issue Links

          Activity

            People

              arshiya9414 Syeda Arshiya Tabreen
              arshiya9414 Syeda Arshiya Tabreen
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: