Uploaded image for project: 'HBase'
  1. HBase
  2. HBASE-28167

HMaster crashes due to NPE at AsyncFSWAL.closeWriter

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Major
    • Resolution: Unresolved
    • 2.4.17
    • None
    • master
    • None

    Description

      I am testing the upgrade process of HMaster, when starting up the new version HMaster 2.4.17, it crashed immediately with the following exception.

      2023-10-17 21:03:35,892 ERROR [master/hmaster:16000:becomeActiveMaster] master.HMaster: ***** ABORTING master hmaster,16000,1697576576301: Unhandled exception. Starting shutdown. *****
      org.apache.hadoop.hbase.FailedCloseWALAfterInitializedErrorException: Failed close after init wal failed.
              at org.apache.hadoop.hbase.wal.AbstractFSWALProvider.getWAL(AbstractFSWALProvider.java:167)
              at org.apache.hadoop.hbase.wal.AbstractFSWALProvider.getWAL(AbstractFSWALProvider.java:62)
              at org.apache.hadoop.hbase.wal.WALFactory.getWAL(WALFactory.java:295)
              at org.apache.hadoop.hbase.master.region.MasterRegion.createWAL(MasterRegion.java:200)
              at org.apache.hadoop.hbase.master.region.MasterRegion.bootstrap(MasterRegion.java:220)
              at org.apache.hadoop.hbase.master.region.MasterRegion.create(MasterRegion.java:348)
              at org.apache.hadoop.hbase.master.region.MasterRegionFactory.create(MasterRegionFactory.java:104)
              at org.apache.hadoop.hbase.master.HMaster.finishActiveMasterInitialization(HMaster.java:855)
              at org.apache.hadoop.hbase.master.HMaster.startActiveMasterManager(HMaster.java:2193)
              at org.apache.hadoop.hbase.master.HMaster.lambda$run$0(HMaster.java:528)
              at java.lang.Thread.run(Thread.java:750)
      Caused by: java.io.IOException: java.lang.NullPointerException
              at org.apache.hadoop.hbase.regionserver.wal.AbstractFSWAL.shutdown(AbstractFSWAL.java:979)
              at org.apache.hadoop.hbase.regionserver.wal.AbstractFSWAL.close(AbstractFSWAL.java:1006)
              at org.apache.hadoop.hbase.wal.AbstractFSWALProvider.getWAL(AbstractFSWALProvider.java:165)
              ... 10 more
      Caused by: java.lang.NullPointerException
              at java.util.concurrent.ConcurrentHashMap.putVal(ConcurrentHashMap.java:1011)
              at java.util.concurrent.ConcurrentHashMap.put(ConcurrentHashMap.java:1006)
              at org.apache.hadoop.hbase.regionserver.wal.AsyncFSWAL.closeWriter(AsyncFSWAL.java:743)
              at org.apache.hadoop.hbase.regionserver.wal.AsyncFSWAL.doShutdown(AsyncFSWAL.java:800)
              at org.apache.hadoop.hbase.regionserver.wal.AbstractFSWAL$2.call(AbstractFSWAL.java:951)
              at org.apache.hadoop.hbase.regionserver.wal.AbstractFSWAL$2.call(AbstractFSWAL.java:946)
              at java.util.concurrent.FutureTask.run(FutureTask.java:266)
              at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
              at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
              ... 1 more 

      Reproduce

      It happens non-deterministically, around 2 out of 1802 tests. It might require an exception when HMaster interacts with the HDFS cluster since I noticed the following warning before the NPE exception

      2023-10-17 21:03:35,857 WARN  [master/hmaster:16000:becomeActiveMaster] asyncfs.FanOutOneBlockAsyncDFSOutputHelper: create fan-out dfs output /hbase/MasterData/WALs/hmaster,16000,1697576576301/hmaster%2C16000%2C1697576576301.1697576615700 failed, retry = 0
      org.apache.hadoop.ipc.RemoteException(java.io.IOException): File /hbase/MasterData/WALs/hmaster,16000,1697576576301/hmaster%2C16000%2C1697576576301.1697576615700 could only be replicated to 0 nodes instead of minReplication (=1).  There are 1 datanode(s) running and no node(s) are excluded in this operation.
              at org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.chooseTarget4NewBlock(BlockManager.java:1832)
              at org.apache.hadoop.hdfs.server.namenode.FSDirWriteFileOp.chooseTargetForNewBlock(FSDirWriteFileOp.java:265)
              at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:2586)
              at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:889)
              at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:517)
              at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
              at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:528)
              at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:498)
              at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1038)
              at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1003)
       

      This remote excpetion might lead to a NULL writer and it further causes NPE to crash the HMaster.

      Root Cause

      When invoking inflightWALClosures.put(path.getName(), writer); the writer is still null, this leads to an NPE which is not handled by HMaster and causes the crash.

      hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/wal/AsyncFSWAL.java
      
        private void closeWriter(AsyncWriter writer, Path path) {
          inflightWALClosures.put(path.getName(), writer); // NPE
          closeExecutor.execute(() -> {
            try {
              writer.close();
            } catch (IOException e) {
              LOG.warn("close old writer failed", e);
            } finally {
              // call this even if the above close fails, as there is no other chance we can set closed to
              // true, it will not cause big problems.
              markClosedAndClean(path);
              inflightWALClosures.remove(path.getName());
            }
          });
        }

      I haven't figured out the exact reason why writer is NULL. Will keep digging the root cause. Any help is appreciated!

       

      I have attached the full logs from HMaster.

      Attachments

        1. persistent.tar.gz
          122 kB
          Ke Han
        2. hbase--master-a82083cf5d18.log
          48 kB
          Ke Han

        Activity

          People

            Unassigned Unassigned
            kehan5800 Ke Han
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated: