[HDFS-17157] Transient network failure in lease recovery could be mitigated to ensure better consistency - ASF JIRA

Details

Type: Improvement
Status: Open
Priority: Major
Resolution: Unresolved
Affects Version/s: 2.0.0-alpha, 3.3.6
Fix Version/s: None
Component/s: datanode
Labels:
- pull-request-available

Description

This case is related to ~~HDFS-12070~~.

In ~~HDFS-12070~~, we saw how a faulty drive at a certain datanode could lead to permanent block recovery failure and leaves the file open indefinitely. In the patch, instead of failing the whole lease recovery process when the second stage of block recovery is failed at one datanode, the whole lease recovery process is failed if only these are failed for all the datanodes.

Attached is the code snippet for the second stage of the block recovery, in BlockRecoveryWorker#syncBlock:

...
final List<BlockRecord> successList = new ArrayList<>();     
for (BlockRecord r : participatingList) {        
  try {          
    r.updateReplicaUnderRecovery(bpid, recoveryId, blockId,     newBlock.getNumBytes());     
    successList.add(r);        
  } catch (IOException e) { 
...

However, because of transient network failure, the RPC in updateReplicaUnderRecovery initiated from the primary datanode to another datanode could return an EOFException while the other side does not process the RPC at all or throw an IOException when reading from the socket.

        at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
        at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
        at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
        at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
        at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:824)
        at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:788)
        at org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1495)
        at org.apache.hadoop.ipc.Client.call(Client.java:1437)
        at org.apache.hadoop.ipc.Client.call(Client.java:1347)
        at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:228)
        at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:116)
        at com.sun.proxy.$Proxy29.updateReplicaUnderRecovery(Unknown Source)
        at org.apache.hadoop.hdfs.protocolPB.InterDatanodeProtocolTranslatorPB.updateReplicaUnderRecovery(InterDatanodeProtocolTranslatorPB.java:112)
        at org.apache.hadoop.hdfs.server.datanode.BlockRecoveryWorker$BlockRecord.updateReplicaUnderRecovery(BlockRecoveryWorker.java:88)
        at org.apache.hadoop.hdfs.server.datanode.BlockRecoveryWorker$BlockRecord.access$700(BlockRecoveryWorker.java:71)
        at org.apache.hadoop.hdfs.server.datanode.BlockRecoveryWorker$RecoveryTaskContiguous.syncBlock(BlockRecoveryWorker.java:300)
        at org.apache.hadoop.hdfs.server.datanode.BlockRecoveryWorker$RecoveryTaskContiguous.recover(BlockRecoveryWorker.java:188)
        at org.apache.hadoop.hdfs.server.datanode.BlockRecoveryWorker$1.run(BlockRecoveryWorker.java:606)
        at java.lang.Thread.run(Thread.java:748)
Caused by: java.io.EOFException
        at java.io.DataInputStream.readInt(DataInputStream.java:392)
        at org.apache.hadoop.ipc.Client$IpcStreams.readResponse(Client.java:1796)
        at org.apache.hadoop.ipc.Client$Connection.receiveRpcResponse(Client.java:1165)
        at org.apache.hadoop.ipc.Client$Connection.run(Client.java:1061)

Then if there is any other datanode in which the second stage of block recovery success, the lease recovery would be successful and close the file. However, the last block failed to be synced to that failed datanode and this inconsistency could potentially last for a very long time.

To fix the issue, I propose adding a configurable retry of updateReplicaUnderRecovery RPC so that transient network failure could be mitigated.

Any comments and suggestions would be appreciated.

Attachments

Issue Links

is related to

HDFS-12070 Failed block recovery leaves files open indefinitely and at risk for data loss

Resolved

links to

GitHub Pull Request #5954

Transient network failure in lease recovery could be mitigated to ensure better consistency

Details

Description

Attachments

Issue Links

Activity

People

Dates