Uploaded image for project: 'Hadoop HDFS'
  1. Hadoop HDFS
  2. HDFS-15367

Fail to get file checksum even if there's an available replica.

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Major
    • Resolution: Unresolved
    • 2.10.0
    • None
    • dfsclient, namenode
    • None

    Description

      DFSClient can fail to get file checksum even when there's an available replica. One possible triggering process of the bug is as follows:

      • Start a cluster with three DNs (DN1, DN2, DN3). The default replication factor is set to 2.
      • Both DN1 and DN3 register with NN, as can be seen from NN's log (DN1 uses port 9866 while DN3 uses port 9666):
      2020-05-21 01:24:57,196 INFO org.apache.hadoop.net.NetworkTopology: Adding a new node: /default-rack/127.0.0.1:9866
      2020-05-21 01:25:06,155 INFO org.apache.hadoop.net.NetworkTopology: Adding a new node: /default-rack/127.0.0.1:9666
      • DN1 sends block report to NN, as can be seen from NN's log:
      2020-05-21 01:24:57,336 INFO BlockStateChange: BLOCK* processReport 0x3ae7e5805f2e704e: from storage DS-638ee5ae-e435-4d82-ae4f-9066bc7eb850 node DatanodeRegistration(127.0.0.1:9866, datanodeUuid=b0702574-968f-4817-a660-42ec1c475606, infoPort=9864, infoSecurePort=0, ipcPort=9867, storageInfo=lv=-57;cid=CID-75860997-47d0-4957-a4e6-4edbd79d64b8;nsid=49920454;c=1590024277030), blocks: 0, hasStaleStorage: false, processing time: 3 msecs, invalidatedBlocks: 0
      • DN3 fails to send the block report to NN because of a network partition. We inject network partition to fail DN3's blockReport RPC. Also, NN's log does not contain the "processReport" entry for DN3.
      • DFSClient uploads a file. NN chooses DN1 and DN3 to host the replicas. The network partition on DN3 stops, so the file is uploaded successfully. This can be verified by NN's log:
      2020-05-21 01:25:13,644 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* allocate blk_1073741825_1001, replicas=127.0.0.1:9666, 127.0.0.1:9866 for /dir1/file1._COPYING_
      • Stop DN1, as can be seen from DN1's log:
      2020-05-21 01:25:21,114 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: SHUTDOWN_MSG:
      • DFSClient tries to get the file checksum. It fails to connect to DN1 and gives up. The bug is triggered.
      20/05/21 01:25:34 INFO hdfs.DFSClient: Connecting to datanode 127.0.0.1:9866
      20/05/21 01:25:34 WARN hdfs.DFSClient: src=/dir1/file1, datanodes[0]=DatanodeInfoWithStorage[127.0.0.1:9866,DS-638ee5ae-e435-4d82-ae4f-9066bc7eb850,DISK]
      java.net.ConnectException: Connection refused
              at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
              at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:714)
              at org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206)
              at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:531)
              at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:495)
              at org.apache.hadoop.hdfs.DFSClient.connectToDN(DFSClient.java:1925)
              at org.apache.hadoop.hdfs.DFSClient.getFileChecksum(DFSClient.java:1798)
              at org.apache.hadoop.hdfs.DistributedFileSystem$33.doCall(DistributedFileSystem.java:1638)
              at org.apache.hadoop.hdfs.DistributedFileSystem$33.doCall(DistributedFileSystem.java:1635)
              at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
              at org.apache.hadoop.hdfs.DistributedFileSystem.getFileChecksum(DistributedFileSystem.java:1646)
              at org.apache.hadoop.fs.shell.Display$Checksum.processPath(Display.java:199)
              at org.apache.hadoop.fs.shell.Command.processPaths(Command.java:327)
              at org.apache.hadoop.fs.shell.Command.processPathArgument(Command.java:299)
              at org.apache.hadoop.fs.shell.Command.processArgument(Command.java:281)
              at org.apache.hadoop.fs.shell.Command.processArguments(Command.java:265)
              at org.apache.hadoop.fs.shell.FsCommand.processRawArguments(FsCommand.java:119)
              at org.apache.hadoop.fs.shell.Command.run(Command.java:175)
              at org.apache.hadoop.fs.FsShell.run(FsShell.java:317)
              at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76)
              at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:90)
              at org.apache.hadoop.fs.FsShell.main(FsShell.java:380)
      checksum: Fail to get block MD5 for BP-2092781073-172.17.0.4-1590024277030:blk_1073741825_1001

      Since DN3 also has a replica of the file, DFSClient should try to contact DN3 to get the checksum.

      To verify that DFSClient didn't connect to DN3, we changed the DEBUG log in DFSClient.connectToDN() to INFO log. From the above error messages printed by DFSClient we can see that the DFSClient only tries to connect to DN1.

       

      Attachments

        Activity

          People

            Unassigned Unassigned
            ycozy YCozy
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated: