Uploaded image for project: 'Hadoop HDFS'
  1. Hadoop HDFS
  2. HDFS-17093

Fix block report lease issue to avoid missing some storages report.

    XMLWordPrintableJSON

Details

    • Reviewed

    Description

      In our cluster of 800+ nodes, after restarting the namenode, we found that some datanodes did not report enough blocks, causing the namenode to stay in secure mode for a long time after restarting because of incomplete block reporting
      I found in the logs of the datanode with incomplete block reporting that the first FBR attempt failed, possibly due to namenode stress, and then a second FBR attempt was made as follows:

      ....
      2023-07-17 11:29:28,982 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Unsuccessfully sent block report 0x6237a52c1e817e,  containing 12 storage report(s), of which we sent 1. The reports had 1099057 total blocks and used 1 RPC(s). This took 294 msec to generate and 101721 msecs for RPC and NN processing. Got back no commands.
      2023-07-17 11:37:04,014 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Successfully sent block report 0x62382416f3f055,  containing 12 storage report(s), of which we sent 12. The reports had 1099048 total blocks and used 12 RPC(s). This took 295 msec to generate and 11647 msecs for RPC and NN processing. Got back no commands. 

      There's nothing wrong with that. Retry the send if it fails But on the namenode side of the logic:

      if (namesystem.isInStartupSafeMode()
          && !StorageType.PROVIDED.equals(storageInfo.getStorageType())
          && storageInfo.getBlockReportCount() > 0) {
        blockLog.info("BLOCK* processReport 0x{} with lease ID 0x{}: "
            + "discarded non-initial block report from {}"
            + " because namenode still in startup phase",
            strBlockReportId, fullBrLeaseId, nodeID);
        blockReportLeaseManager.removeLease(node);
        return !node.hasStaleStorages();
      } 

      When a disk was identified as the report is not the first time, namely storageInfo. GetBlockReportCount > 0, Will remove the ticket from the datanode, lead to a second report failed because no lease

      Attachments

        Issue Links

          Activity

            People

              yuyanlei Yanlei Yu
              yuyanlei Yanlei Yu
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: