Uploaded image for project: 'Hadoop HDFS'
  1. Hadoop HDFS
  2. HDFS-17157

Transient network failure in lease recovery could be mitigated to ensure better consistency

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Open
    • Major
    • Resolution: Unresolved
    • 2.0.0-alpha, 3.3.6
    • None
    • datanode

    Description

      This case is related to HDFS-12070.

      In HDFS-12070, we saw how a faulty drive at a certain datanode could lead to permanent block recovery failure and leaves the file open indefinitely.  In the patch, instead of failing the whole lease recovery process when the second stage of block recovery is failed at one datanode, the whole lease recovery process is failed if only these are failed for all the datanodes. 

      Attached is the code snippet for the second stage of the block recovery, in BlockRecoveryWorker#syncBlock:

      ...
      final List<BlockRecord> successList = new ArrayList<>();     
      for (BlockRecord r : participatingList) {        
        try {          
          r.updateReplicaUnderRecovery(bpid, recoveryId, blockId,     newBlock.getNumBytes());     
          successList.add(r);        
        } catch (IOException e) { 
      ...

      However, because of transient network failure, the RPC in updateReplicaUnderRecovery initiated from the primary datanode to another datanode could return an EOFException while the other side does not process the RPC at all or throw an IOException when reading from the socket. 

              at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
              at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
              at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
              at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
              at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:824)
              at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:788)
              at org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1495)
              at org.apache.hadoop.ipc.Client.call(Client.java:1437)
              at org.apache.hadoop.ipc.Client.call(Client.java:1347)
              at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:228)
              at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:116)
              at com.sun.proxy.$Proxy29.updateReplicaUnderRecovery(Unknown Source)
              at org.apache.hadoop.hdfs.protocolPB.InterDatanodeProtocolTranslatorPB.updateReplicaUnderRecovery(InterDatanodeProtocolTranslatorPB.java:112)
              at org.apache.hadoop.hdfs.server.datanode.BlockRecoveryWorker$BlockRecord.updateReplicaUnderRecovery(BlockRecoveryWorker.java:88)
              at org.apache.hadoop.hdfs.server.datanode.BlockRecoveryWorker$BlockRecord.access$700(BlockRecoveryWorker.java:71)
              at org.apache.hadoop.hdfs.server.datanode.BlockRecoveryWorker$RecoveryTaskContiguous.syncBlock(BlockRecoveryWorker.java:300)
              at org.apache.hadoop.hdfs.server.datanode.BlockRecoveryWorker$RecoveryTaskContiguous.recover(BlockRecoveryWorker.java:188)
              at org.apache.hadoop.hdfs.server.datanode.BlockRecoveryWorker$1.run(BlockRecoveryWorker.java:606)
              at java.lang.Thread.run(Thread.java:748)
      Caused by: java.io.EOFException
              at java.io.DataInputStream.readInt(DataInputStream.java:392)
              at org.apache.hadoop.ipc.Client$IpcStreams.readResponse(Client.java:1796)
              at org.apache.hadoop.ipc.Client$Connection.receiveRpcResponse(Client.java:1165)
              at org.apache.hadoop.ipc.Client$Connection.run(Client.java:1061) 

      Then if there is any other datanode in which the second stage of block recovery success, the lease recovery would be successful and close the file. However, the last block failed to be synced to that failed datanode and this inconsistency could potentially last for a very long time. 

      To fix the issue, I propose adding a configurable retry of updateReplicaUnderRecovery RPC so that transient network failure could be mitigated.

      Any comments and suggestions would be appreciated.

       

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              functioner Haoze Wu
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated: