[HDFS-17488] DN can fail IBRs with NPE when a volume is removed - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Open
Priority: Major
Resolution: Unresolved
Affects Version/s: None
Fix Version/s: None
Component/s: hdfs
Labels:
- pull-request-available

Description

Error logs

2024-04-22 15:46:33,422 [BP-1842952724-10.22.68.249-1713771988830 heartbeating to localhost/127.0.0.1:64977] ERROR datanode.DataNode (BPServiceActor.java:run(922)) - Exception in BPOfferService for Block pool BP-1842952724-10.22.68.249-1713771988830 (Datanode Uuid 1659ffaf-1a80-4a8e-a542-643f6bd97ed4) service to localhost/127.0.0.1:64977
java.lang.NullPointerException
    at org.apache.hadoop.hdfs.protocolPB.DatanodeProtocolClientSideTranslatorPB.blockReceivedAndDeleted(DatanodeProtocolClientSideTranslatorPB.java:246)
    at org.apache.hadoop.hdfs.server.datanode.IncrementalBlockReportManager.sendIBRs(IncrementalBlockReportManager.java:218)
    at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.offerService(BPServiceActor.java:749)
    at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:920)
    at java.lang.Thread.run(Thread.java:748)

The root cause is in BPOfferService#notifyNamenodeBlock, happens when it's called on a block belonging to a volume already removed prior. Because the volume was already removed

private void notifyNamenodeBlock(ExtendedBlock block, BlockStatus status,
    String delHint, String storageUuid, boolean isOnTransientStorage) {
  checkBlock(block);
  final ReceivedDeletedBlockInfo info = new ReceivedDeletedBlockInfo(
      block.getLocalBlock(), status, delHint);
  final DatanodeStorage storage = dn.getFSDataset().getStorage(storageUuid);
  
  // storage == null here because it's already removed earlier.

  for (BPServiceActor actor : bpServices) {
    actor.getIbrManager().notifyNamenodeBlock(info, storage,
        isOnTransientStorage);
  }
}

so IBRs with a null storage are now pending.

The reason why notifyNamenodeBlock can trigger on such blocks is up in DirectoryScanner#reconcile

  public void reconcile() throws IOException {
    LOG.debug("reconcile start DirectoryScanning");
    scan();

    // If a volume is removed here after scan() already finished running,
    // diffs is stale and checkAndUpdate will run on a removed volume

    // HDFS-14476: run checkAndUpdate with batch to avoid holding the lock too
    // long
    int loopCount = 0;
    synchronized (diffs) {
      for (final Map.Entry<String, ScanInfo> entry : diffs.getEntries()) {
        dataset.checkAndUpdate(entry.getKey(), entry.getValue());        
        ...
  }

Inside checkAndUpdate, memBlockInfo is null because all the block meta in memory is removed during the volume removal, but diskFile still exists. Then DataNode#notifyNamenodeDeletedBlock (and further down the line, notifyNamenodeBlock) is called on this block.

Attachments

Issue Links

links to

GitHub Pull Request #6759

Activity

People

Assignee:: Felix N

Reporter:: Felix N

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 22/Apr/24 08:10

Updated:: 19 hours ago