Uploaded image for project: 'HBase'
  1. HBase
  2. HBASE-23084

Out-of-order adjacent keys causes regionserver crash

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Critical
    • Resolution: Unresolved
    • None
    • None
    • regionserver
    • None

    Description

      when auto flush and manual flush is invoked,  cell in immutable memstore segment is scanned one by one to build data block and bloom filter block, during this time, two adjacent keys are compared to assert that they are in lexicographical order, but regretfully, the assertion fails in some scenarios, when RS encounter these assertion failure, always call abort function to terminate abnormally.

       

      the version is 0ba7cc01a13dbb7fec8e6c5d599fc8b4936bae61  tag: rel/2.1.5

       

      At least three cases are found.

      error case#1:  getQualifierArray throws java.lang.ArrayIndexOutOfBoundsException

       

      17:57:50.328 [MemStoreFlusher.0] ERROR org.apache.hadoop.hbase.regionserver.HRegionServer - ***** ABORTING region server ${regionserver}: Replay of WAL required. Forcing server shutdown *****
      org.apache.hadoop.hbase.DroppedSnapshotException: region: ${region}.
              at org.apache.hadoop.hbase.regionserver.HRegion.internalFlushCacheAndCommit(HRegion.java:2788) [hbase-server-2.1.5.jar:2.1.5]
              at org.apache.hadoop.hbase.regionserver.HRegion.internalFlushcache(HRegion.java:2467) [hbase-server-2.1.5.jar:2.1.5]
              at org.apache.hadoop.hbase.regionserver.HRegion.internalFlushcache(HRegion.java:2439) [hbase-server-2.1.5.jar:2.1.5]
              at org.apache.hadoop.hbase.regionserver.HRegion.flushcache(HRegion.java:2329) [hbase-server-2.1.5.jar:2.1.5]
              at org.apache.hadoop.hbase.regionserver.MemStoreFlusher.flushRegion(MemStoreFlusher.java:612) [hbase-server-2.1.5.jar:2.1.5]
              at org.apache.hadoop.hbase.regionserver.MemStoreFlusher.flushRegion(MemStoreFlusher.java:581) [hbase-server-2.1.5.jar:2.1.5]
              at org.apache.hadoop.hbase.regionserver.MemStoreFlusher.access$1000(MemStoreFlusher.java:68) [hbase-server-2.1.5.jar:2.1.5]
              at org.apache.hadoop.hbase.regionserver.MemStoreFlusher$FlushHandler.run(MemStoreFlusher.java:361) [hbase-server-2.1.5.jar:2.1.5]
              at java.lang.Thread.run(Thread.java:748) [?:1.8.0_144]
      Caused by: java.lang.ArrayIndexOutOfBoundsException
              at org.apache.hadoop.hbase.util.ByteBufferUtils.copyFromBufferToArray(ByteBufferUtils.java:1155) ~[hbase-common-2.1.5.jar:2.1.5]
              at org.apache.hadoop.hbase.CellUtil.copyQualifierTo(CellUtil.java:309) ~[hbase-common-2.1.5.jar:2.1.5]
              at org.apache.hadoop.hbase.CellUtil.cloneQualifier(CellUtil.java:121) ~[hbase-common-2.1.5.jar:2.1.5]
              at org.apache.hadoop.hbase.ByteBufferKeyValue.getQualifierArray(ByteBufferKeyValue.java:112) ~[hbase-common-2.1.5.jar:2.1.5]
              at org.apache.hadoop.hbase.CellUtil.getCellKeyAsString(CellUtil.java:1306) ~[hbase-common-2.1.5.jar:2.1.5]
              at org.apache.hadoop.hbase.CellUtil.toString(CellUtil.java:1483) ~[hbase-common-2.1.5.jar:2.1.5]
              at org.apache.hadoop.hbase.ByteBufferKeyValue.toString(ByteBufferKeyValue.java:296) ~[hbase-common-2.1.5.jar:2.1.5]
              at java.lang.String.valueOf(String.java:2994) ~[?:1.8.0_144]
              at java.lang.StringBuilder.append(StringBuilder.java:131) ~[?:1.8.0_144]
              at org.apache.hadoop.hbase.regionserver.StoreScanner.checkScanOrder(StoreScanner.java:1007) ~[hbase-server-2.1.5.jar:2.1.5]
              at org.apache.hadoop.hbase.regionserver.StoreScanner.next(StoreScanner.java:582) ~[hbase-server-2.1.5.jar:2.1.5]
              at org.apache.hadoop.hbase.regionserver.StoreFlusher.performFlush(StoreFlusher.java:127) ~[hbase-server-2.1.5.jar:2.1.5]
              at org.apache.hadoop.hbase.regionserver.DefaultStoreFlusher.flushSnapshot(DefaultStoreFlusher.java:69) ~[hbase-server-2.1.5.jar:2.1.5]
              at org.apache.hadoop.hbase.regionserver.HStore.flushCache(HStore.java:1021) ~[hbase-server-2.1.5.jar:2.1.5]
              at org.apache.hadoop.hbase.regionserver.HStore$StoreFlusherImpl.flushCache(HStore.java:2259) ~[hbase-server-2.1.5.jar:2.1.5]
              at org.apache.hadoop.hbase.regionserver.HRegion.internalFlushCacheAndCommit(HRegion.java:2733) ~[hbase-server-2.1.5.jar:2.1.5]
              ... 8 more
      17:57:50.332 [MemStoreFlusher.0] ERROR org.apache.hadoop.hbase.regionserver.HRegionServer - RegionServer abort: loaded coprocessors are: [org.apache.phoenix.coprocessor.ScanRegionObserver, org.apache.phoenix.coprocessor.UngroupedAggregateRegionObserver, org.apache.phoenix.hbase.index.Indexer, org.apache.phoenix.coprocessor.GroupedAggregateRegionObserver, org.apache.phoenix.coprocessor.ServerCachingEndpointImpl]
        
      

       

       

      error case#2:  StoreScanner.checkScanOrder triggers "prevKey followed by a smaller key"

       

      Exception in thread "RS_IN_MEMORY_COMPACTION-regionserver/${regionserver}" java.lang.AssertionError: Key ${prevKey} followed by a smaller key ${key}
              at org.apache.hadoop.hbase.regionserver.StoreScanner.checkScanOrder(StoreScanner.java:1007)
              at org.apache.hadoop.hbase.regionserver.StoreScanner.next(StoreScanner.java:582)
              at org.apache.hadoop.hbase.regionserver.MemStoreCompactorSegmentsIterator.refillKVS(MemStoreCompactorSegmentsIterator.java:150)
              at org.apache.hadoop.hbase.regionserver.MemStoreCompactorSegmentsIterator.hasNext(MemStoreCompactorSegmentsIterator.java:72)
              at org.apache.hadoop.hbase.regionserver.CellChunkImmutableSegment.initializeCellSet(CellChunkImmutableSegment.java:146)
              at org.apache.hadoop.hbase.regionserver.CellChunkImmutableSegment.<init>(CellChunkImmutableSegment.java:66)
              at org.apache.hadoop.hbase.regionserver.SegmentFactory.createImmutableSegment(SegmentFactory.java:133)
              at org.apache.hadoop.hbase.regionserver.SegmentFactory.createImmutableSegmentByCompaction(SegmentFactory.java:59)
              at org.apache.hadoop.hbase.regionserver.MemStoreCompactor.createSubstitution(MemStoreCompactor.java:217)
              at org.apache.hadoop.hbase.regionserver.MemStoreCompactor.doCompaction(MemStoreCompactor.java:171)
              at org.apache.hadoop.hbase.regionserver.MemStoreCompactor.start(MemStoreCompactor.java:109)
              at org.apache.hadoop.hbase.regionserver.CompactingMemStore.flushInMemory(CompactingMemStore.java:422)
              at org.apache.hadoop.hbase.regionserver.CompactingMemStore$InMemoryFlushRunnable.run(CompactingMemStore.java:534)
              at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
              at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
              at java.lang.Thread.run(Thread.java:748)
              at java.lang.Thread.run(Thread.java:748)
      12:34:07.695 [MemStoreFlusher.0] INFO  org.apache.hadoop.hbase.regionserver.HRegion - Flushing 1/1 column families, dataSize=239.26 MB heapSize=256.00 MB
      12:34:08.476 [MemStoreFlusher.0] ERROR org.apache.hadoop.hbase.regionserver.HRegionServer - ***** ABORTING region server ${regionserver}: Replay of WAL required. Forcing server shutdown *****
      org.apache.hadoop.hbase.DroppedSnapshotException: region: ${region}.
              at org.apache.hadoop.hbase.regionserver.HRegion.internalFlushCacheAndCommit(HRegion.java:2788) [hbase-server-2.1.5.jar:2.1.5]
              at org.apache.hadoop.hbase.regionserver.HRegion.internalFlushcache(HRegion.java:2467) [hbase-server-2.1.5.jar:2.1.5]
              at org.apache.hadoop.hbase.regionserver.HRegion.internalFlushcache(HRegion.java:2439) [hbase-server-2.1.5.jar:2.1.5]
              at org.apache.hadoop.hbase.regionserver.HRegion.flushcache(HRegion.java:2329) [hbase-server-2.1.5.jar:2.1.5]
              at org.apache.hadoop.hbase.regionserver.MemStoreFlusher.flushRegion(MemStoreFlusher.java:612) [hbase-server-2.1.5.jar:2.1.5]
              at org.apache.hadoop.hbase.regionserver.MemStoreFlusher.flushRegion(MemStoreFlusher.java:581) [hbase-server-2.1.5.jar:2.1.5]
              at org.apache.hadoop.hbase.regionserver.MemStoreFlusher.access$1000(MemStoreFlusher.java:68) [hbase-server-2.1.5.jar:2.1.5]
              at org.apache.hadoop.hbase.regionserver.MemStoreFlusher$FlushHandler.run(MemStoreFlusher.java:361) [hbase-server-2.1.5.jar:2.1.5]
              at java.lang.Thread.run(Thread.java:748) [?:1.8.0_144]
      Caused by: java.lang.AssertionError: Key ${prevKey} followed by a smaller key ${key}
              at org.apache.hadoop.hbase.regionserver.StoreScanner.checkScanOrder(StoreScanner.java:1007) ~[hbase-server-2.1.5.jar:2.1.5]
              at org.apache.hadoop.hbase.regionserver.StoreScanner.next(StoreScanner.java:582) ~[hbase-server-2.1.5.jar:2.1.5]
              at org.apache.hadoop.hbase.regionserver.StoreFlusher.performFlush(StoreFlusher.java:127) ~[hbase-server-2.1.5.jar:2.1.5]
              at org.apache.hadoop.hbase.regionserver.DefaultStoreFlusher.flushSnapshot(DefaultStoreFlusher.java:69) ~[hbase-server-2.1.5.jar:2.1.5]
              at org.apache.hadoop.hbase.regionserver.HStore.flushCache(HStore.java:1021) ~[hbase-server-2.1.5.jar:2.1.5]
              at org.apache.hadoop.hbase.regionserver.HStore$StoreFlusherImpl.flushCache(HStore.java:2259) ~[hbase-server-2.1.5.jar:2.1.5]
              at org.apache.hadoop.hbase.regionserver.HRegion.internalFlushCacheAndCommit(HRegion.java:2733) ~[hbase-server-2.1.5.jar:2.1.5]
              ... 8 more
      
      
      

       

       

       

      error case#3: BloomContext.sanityCheck triggers "Added a key not lexically larger than previous." 

      so, I alter the assertion code into if stmt

      --- a/hbase-common/src/main/java/org/apache/hadoop/hbase/util/ByteBufferUtils.java
      +++ b/hbase-common/src/main/java/org/apache/hadoop/hbase/util/ByteBufferUtils.java
      @@ -33,6 +33,8 @@ import org.apache.hadoop.hbase.io.util.StreamUtils;
       import org.apache.hadoop.io.IOUtils;
       import org.apache.hadoop.io.WritableUtils;
       import org.apache.yetus.audience.InterfaceAudience;
      +import org.slf4j.Logger;
      +import org.slf4j.LoggerFactory;
       import sun.nio.ch.DirectBuffer;
       
       import org.apache.hbase.thirdparty.com.google.common.annotations.VisibleForTesting;
      @@ -51,6 +53,7 @@ public final class ByteBufferUtils {
         @VisibleForTesting
         final static boolean UNSAFE_AVAIL = UnsafeAvailChecker.isAvailable();
         public final static boolean UNSAFE_UNALIGNED = UnsafeAvailChecker.unaligned();
      +  private static Logger LOG = LoggerFactory.getLogger(ByteBufferUtils.class);
       
         private ByteBufferUtils() {
         }
      @@ -1152,7 +1155,18 @@ public final class ByteBufferUtils {
         public static void copyFromBufferToArray(byte[] out, ByteBuffer in, int sourceOffset,
             int destinationOffset, int length) {
           if (in.hasArray()) {
      -      System.arraycopy(in.array(), sourceOffset + in.arrayOffset(), out, destinationOffset, length);
      +      try {
      +        System.arraycopy(in.array(), sourceOffset + in.arrayOffset(), out, destinationOffset, length);
      +      } catch (Throwable e){
      +        if (out == null) {
      +          LOG.warn("[QINGGUI] out=null");
      +        } else if (in == null || in.array() == null) {
      +          LOG.warn("[QINGGUI] in=null");
      +        } else {
      +          LOG.warn("[QINGGUI] out.length={}, in.length={}, sourceOffset={} destinationOffset={} length={}",
      +            out.length, in.array().length, sourceOffset, destinationOffset, length);
      +        }
      +      }
           } else if (UNSAFE_AVAIL) {
             UnsafeAccess.copy(in, sourceOffset, out, destinationOffset, length);
           } else {
      
      

       

      re-deploy my binary, then the failure drifts to bloom filter building code.

      12:54:39.966 [rs(${regionserver})-flush-proc-pool3-thread-2] WARN  org.apache.hadoop.hbase.regionserver.HStore - Failed flushing store file, retrying num=9
      java.io.IOException: Added a key not lexically larger than previous. Current cell ...
              at org.apache.hadoop.hbase.util.BloomContext.sanityCheck(BloomContext.java:63) ~[hbase-server-2.1.5.jar:2.1.5]
              at org.apache.hadoop.hbase.util.BloomContext.writeBloom(BloomContext.java:54) ~[hbase-server-2.1.5.jar:2.1.5]
              at org.apache.hadoop.hbase.regionserver.StoreFileWriter.appendGeneralBloomfilter(StoreFileWriter.java:213) ~[hbase-server-2.1.5.jar:2.1.5]
              at org.apache.hadoop.hbase.regionserver.StoreFileWriter.append(StoreFileWriter.java:232) ~[hbase-server-2.1.5.jar:2.1.5]
              at org.apache.hadoop.hbase.regionserver.StoreFlusher.performFlush(StoreFlusher.java:133) ~[hbase-server-2.1.5.jar:2.1.5]
              at org.apache.hadoop.hbase.regionserver.DefaultStoreFlusher.flushSnapshot(DefaultStoreFlusher.java:69) ~[hbase-server-2.1.5.jar:2.1.5]
              at org.apache.hadoop.hbase.regionserver.HStore.flushCache(HStore.java:1021) [hbase-server-2.1.5.jar:2.1.5]
              at org.apache.hadoop.hbase.regionserver.HStore$StoreFlusherImpl.flushCache(HStore.java:2259) [hbase-server-2.1.5.jar:2.1.5]
              at org.apache.hadoop.hbase.regionserver.HRegion.internalFlushCacheAndCommit(HRegion.java:2733) [hbase-server-2.1.5.jar:2.1.5]
              at org.apache.hadoop.hbase.regionserver.HRegion.internalFlushcache(HRegion.java:2467) [hbase-server-2.1.5.jar:2.1.5]
              at org.apache.hadoop.hbase.regionserver.HRegion.internalFlushcache(HRegion.java:2439) [hbase-server-2.1.5.jar:2.1.5]
              at org.apache.hadoop.hbase.regionserver.HRegion.flushcache(HRegion.java:2329) [hbase-server-2.1.5.jar:2.1.5]
              at org.apache.hadoop.hbase.regionserver.HRegion.flush(HRegion.java:2232) [hbase-server-2.1.5.jar:2.1.5]
              at org.apache.hadoop.hbase.procedure.flush.FlushTableSubprocedure$RegionFlushTask.call(FlushTableSubprocedure.java:68) [hbase-server-2.1.5.jar:2.1.5]
              at org.apache.hadoop.hbase.procedure.flush.FlushTableSubprocedure$RegionFlushTask.call(FlushTableSubprocedure.java:56) [hbase-server-2.1.5.jar:2.1.5]
              at java.util.concurrent.FutureTask.run(FutureTask.java:266) [?:1.8.0_144]
              at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [?:1.8.0_144]
              at java.util.concurrent.FutureTask.run(FutureTask.java:266) [?:1.8.0_144]
              at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_144]
              at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_144]
              at java.lang.Thread.run(Thread.java:748) [?:1.8.0_144]
      12:54:39.979 [rs(${regionserver})-flush-proc-pool3-thread-2] ERROR org.apache.hadoop.hbase.regionserver.HRegionServer - ***** ABORTING region server ${regionserver}: Replay of WAL required. Forcing server shutdown *****
      org.apache.hadoop.hbase.DroppedSnapshotException: region: ${region}.
              at org.apache.hadoop.hbase.regionserver.HRegion.internalFlushCacheAndCommit(HRegion.java:2788) [hbase-server-2.1.5.jar:2.1.5]
              at org.apache.hadoop.hbase.regionserver.HRegion.internalFlushcache(HRegion.java:2467) [hbase-server-2.1.5.jar:2.1.5]
              at org.apache.hadoop.hbase.regionserver.HRegion.internalFlushcache(HRegion.java:2439) [hbase-server-2.1.5.jar:2.1.5]
              at org.apache.hadoop.hbase.regionserver.HRegion.flushcache(HRegion.java:2329) [hbase-server-2.1.5.jar:2.1.5]
              at org.apache.hadoop.hbase.regionserver.HRegion.flush(HRegion.java:2232) [hbase-server-2.1.5.jar:2.1.5]
              at org.apache.hadoop.hbase.procedure.flush.FlushTableSubprocedure$RegionFlushTask.call(FlushTableSubprocedure.java:68) [hbase-server-2.1.5.jar:2.1.5]
              at org.apache.hadoop.hbase.procedure.flush.FlushTableSubprocedure$RegionFlushTask.call(FlushTableSubprocedure.java:56) [hbase-server-2.1.5.jar:2.1.5]
              at java.util.concurrent.FutureTask.run(FutureTask.java:266) [?:1.8.0_144]
              at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [?:1.8.0_144]
              at java.util.concurrent.FutureTask.run(FutureTask.java:266) [?:1.8.0_144]
              at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_144]
              at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_144]
              at java.lang.Thread.run(Thread.java:748) [?:1.8.0_144]
      Caused by: java.io.IOException: Added a key not lexically larger than previous. Current cell = ...
              at org.apache.hadoop.hbase.util.BloomContext.sanityCheck(BloomContext.java:63) ~[hbase-server-2.1.5.jar:2.1.5]
              at org.apache.hadoop.hbase.util.BloomContext.writeBloom(BloomContext.java:54) ~[hbase-server-2.1.5.jar:2.1.5]
              at org.apache.hadoop.hbase.regionserver.StoreFileWriter.appendGeneralBloomfilter(StoreFileWriter.java:213) ~[hbase-server-2.1.5.jar:2.1.5]
              at org.apache.hadoop.hbase.regionserver.StoreFileWriter.append(StoreFileWriter.java:232) ~[hbase-server-2.1.5.jar:2.1.5]
              at org.apache.hadoop.hbase.regionserver.StoreFlusher.performFlush(StoreFlusher.java:133) ~[hbase-server-2.1.5.jar:2.1.5]
              at org.apache.hadoop.hbase.regionserver.DefaultStoreFlusher.flushSnapshot(DefaultStoreFlusher.java:69) ~[hbase-server-2.1.5.jar:2.1.5]
              at org.apache.hadoop.hbase.regionserver.HStore.flushCache(HStore.java:1021) ~[hbase-server-2.1.5.jar:2.1.5]
              at org.apache.hadoop.hbase.regionserver.HStore$StoreFlusherImpl.flushCache(HStore.java:2259) ~[hbase-server-2.1.5.jar:2.1.5]
              at org.apache.hadoop.hbase.regionserver.HRegion.internalFlushCacheAndCommit(HRegion.java:2733) ~[hbase-server-2.1.5.jar:2.1.5]
              ... 12 more
      
      

       

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              satanson ranpanfeng
              Votes:
              0 Vote for this issue
              Watchers:
              11 Start watching this issue

              Dates

                Created:
                Updated: