Details

    • Sub-task
    • Status: Closed
    • Major
    • Resolution: Incomplete
    • 1.3.0
    • None
    • Compaction, regionserver
    • None

    Description

      We observed occasional (rare) failed splits on branch-1.3 builds, that might be another echo of HBASE-13082.

      Essentially here's what seems to be happening:

      First on the RS hosting to-be-split parent seems to see some FileNotFoundException's in the logs. It could be simple file not found on some scanner path:

      16/11/21 07:19:28 WARN hdfs.DFSClient: DFS Read
      java.io.FileNotFoundException: File does not exist: <path to HFile>
      at org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:71)
      at org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:61)
      at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocationsInt(FSNamesystem.java:1828)
      at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1799)
      at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1712)

      ....
      at org.apache.hadoop.hbase.io.hfile.HFileBlock.readWithExtra(HFileBlock.java:733)
      at org.apache.hadoop.hbase.io.hfile.HFileBlock$AbstractFSReader.readAtOffset(HFileBlock.java:1461)
      at org.apache.hadoop.hbase.io.hfile.HFileBlock$FSReaderImpl.readBlockDataInternal(HFileBlock.java:1715)
      at org.apache.hadoop.hbase.io.hfile.HFileBlock$FSReaderImpl.readBlockData(HFileBlock.java:1560)
      at org.apache.hadoop.hbase.io.hfile.HFileReaderV2.readBlock(HFileReaderV2.java:454)
      at org.apache.hadoop.hbase.io.hfile.HFileBlockIndex$BlockIndexReader.loadDataBlockWithScanInfo(HFileBlockIndex.java:271)
      at org.apache.hadoop.hbase.io.hfile.HFileReaderV2$AbstractScannerV2.seekTo(HFileReaderV2.java:651)
      at org.apache.hadoop.hbase.io.hfile.HFileReaderV2$AbstractScannerV2.reseekTo(HFileReaderV2.java:631)
      at org.apache.hadoop.hbase.regionserver.StoreFileScanner.reseekAtOrAfter(StoreFileScanner.java:292)
      at org.apache.hadoop.hbase.regionserver.StoreFileScanner.reseek(StoreFileScanner.java:201)
      at org.apache.hadoop.hbase.regionserver.StoreFileScanner.enforceSeek(StoreFileScanner.java:412)
      at org.apache.hadoop.hbase.regionserver.StoreFileScanner.requestSeek(StoreFileScanner.java:375)
      at org.apache.hadoop.hbase.regionserver.KeyValueHeap.generalizedSeek(KeyValueHeap.java:310)
      at org.apache.hadoop.hbase.regionserver.KeyValueHeap.requestSeek(KeyValueHeap.java:268)
      at org.apache.hadoop.hbase.regionserver.StoreScanner.reseek(StoreScanner.java:889)
      at org.apache.hadoop.hbase.regionserver.StoreScanner.seekToNextRow(StoreScanner.java:867)
      at org.apache.hadoop.hbase.regionserver.Stor

      Or it could be warning from HFileArchiver:

      16/11/21 07:20:44 WARN backup.HFileArchiver: Failed to archive class org.apache.hadoop.hbase.backup.HFileArchiver$FileableStoreFile, <HFile path> because it does not exist! Skipping and continuing on.
      java.io.FileNotFoundException: File/Directory <HFile Path> does not exist.
      at org.apache.hadoop.hdfs.server.namenode.FSDirAttrOp.setTimes(FSDirAttrOp.java:121)
      at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.setTimes(FSNamesystem.java:1910)
      at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.setTimes(NameNodeRpcServer.java:1223)
      at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.setTimes(ClientNamenodeProtocolServerSideTranslatorPB.java:915)
      at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
      at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616)
      at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:982)
      at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2049)
      at or

      Then on the RS hosting parent I'm seeing:

      16/11/21 18:03:17 ERROR regionserver.HRegion: Could not initialize all stores for the region=<region name>
      16/11/21 18:03:17 ERROR regionserver.HRegion: Could not initialize all stores for the <region name>
      16/11/21 18:03:17 WARN ipc.Client: interrupted waiting to send rpc request to server
      java.lang.InterruptedException
      at java.util.concurrent.FutureTask.awaitDone(FutureTask.java:404)
      at java.util.concurrent.FutureTask.get(FutureTask.java:191)
      at org.apache.hadoop.ipc.Client$Connection.sendRpcRequest(Client.java:1060)
      at org.apache.hadoop.ipc.Client.call(Client.java:1455)
      at org.apache.hadoop.ipc.Client.call(Client.java:1413)
      at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:229)
      at com.sun.proxy.$Proxy16.getFileInfo(Unknown Source)
      at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getFileInfo(ClientNamenodeProtocolTranslatorPB.java:771)
      at sun.reflect.GeneratedMethodAccessor83.invoke(Unknown Source)
      at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
      at java.lang.reflect.Method.invoke(Method.java:498)
      at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:191)
      at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
      at com.sun.proxy.$Proxy17.getFileInfo(Unknown Source)
      at sun.reflect.GeneratedMethodAccessor83.invoke(Unknown Source)
      at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
      at java.lang.reflect.Method.invoke(Method.java:498)
      at org.apache.hadoop.hbase.fs.HFileSystem$1.invoke(HFileSystem.java:302)
      at com.sun.proxy.$Proxy18.getFileInfo(Unknown Source)
      at org.apache.hadoop.hdfs.DFSClient.getFileInfo(DFSClient.java:2112)
      at org.apache.hadoop.hdfs.DistributedFileSystem$22.doCall(DistributedFileSystem.java:1305)
      at org.apache.hadoop.hdfs.DistributedFileSystem$22.doCall(DistributedFileSystem.java:1301)
      at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
      at org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1317)
      at org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:421)
      at org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:1426)
      at org.apache.hadoop.hbase.regionserver.HRegionFileSystem.createStoreDir(HRegionFileSystem.java:171)
      at org.apache.hadoop.hbase.regionserver.HStore.<init>(HStore.java:224)
      at org.apache.hadoop.hbase.regionserver.HRegion.instantiateHStore(HRegion.java:5185)
      at org.apache.hadoop.hbase.regionserver.HRegion$1.call(HRegion.java:926)
      at org.apache.hadoop.hbase.regionserver.HRegion$1.call(HRegion.java:923)
      at java.util.concurrent.FutureTask.run(FutureTask.java:266)
      at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
      at java.util.concurrent.FutureTask.run(FutureTask.java:266)
      at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
      at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
      at java.lang.Thread.run(Thread.java:745)
      ...
      16/11/21 18:03:17 FATAL regionserver.HRegionServer: ABORTING region server <server name>: Abort; we got an error after point-of-no-return

      So we've got past PONR and abort; then on the RSs where daughters are to be opened I'm seeing:

      16/11/21 18:03:43 ERROR handler.OpenRegionHandler: Failed open of region= <region name>, starting to roll back the global memstore size.
      java.io.IOException: java.io.IOException: java.io.FileNotFoundException: File does not exist: < HFile name>
      at org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:71)
      at org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:61)
      at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocationsInt(FSNamesystem.java:1828)
      at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1799)
      at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1712)
      at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getBlockLocations(NameNodeRpcServer.java:588)
      at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getBlockLocations(ClientNamenodeProtocolServerSideTranslatorPB.java:365)
      at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
      at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616)
      at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:982)
      at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2049)
      at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2045)
      at java.security.AccessController.doPrivileged(Native Method)
      at javax.security.auth.Subject.doAs(Subject.java:422)
      at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1698)
      at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2045)

      at org.apache.hadoop.hbase.regionserver.HRegion.initializeStores(HRegion.java:952)
      at org.apache.hadoop.hbase.regionserver.HRegion.initializeRegionInternals(HRegion.java:827)
      at org.apache.hadoop.hbase.regionserver.HRegion.initialize(HRegion.java:802)
      at org.apache.hadoop.hbase.regionserver.HRegion.openHRegion(HRegion.java:6708)
      at org.apache.hadoop.hbase.regionserver.HRegion.openHRegion(HRegion.java:6669)
      at org.apache.hadoop.hbase.regionserver.HRegion.openHRegion(HRegion.java:6640)
      at org.apache.hadoop.hbase.regionserver.HRegion.openHRegion(HRegion.java:6596)
      at org.apache.hadoop.hbase.regionserver.HRegion.openHRegion(HRegion.java:6547)
      at org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler.openRegion(OpenRegionHandler.java:362)
      at org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler.process(OpenRegionHandler.java:129)
      at org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:129)
      at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
      at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
      at java.lang.Thread.run(Thread.java:745)

      And regions remains offline. There's no dataloss here as daughters never open up and the failed split could be recovered manually using the following procedure:

      • manually remove daughters from hbase:meta
      • move daughter region HDFS directories out of the way
      • delete the parent region from hbase:meta
      • hbck -fixMeta to add the parent region back
      • failover the active master
      • hbck -fixAssignments after master startup

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              mantonov Mikhail Antonov
              Votes:
              0 Vote for this issue
              Watchers:
              15 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: