Details
-
Bug
-
Status: Open
-
Major
-
Resolution: Unresolved
-
None
-
None
Description
Error logs
2024-04-22 15:46:33,422 [BP-1842952724-10.22.68.249-1713771988830 heartbeating to localhost/127.0.0.1:64977] ERROR datanode.DataNode (BPServiceActor.java:run(922)) - Exception in BPOfferService for Block pool BP-1842952724-10.22.68.249-1713771988830 (Datanode Uuid 1659ffaf-1a80-4a8e-a542-643f6bd97ed4) service to localhost/127.0.0.1:64977 java.lang.NullPointerException at org.apache.hadoop.hdfs.protocolPB.DatanodeProtocolClientSideTranslatorPB.blockReceivedAndDeleted(DatanodeProtocolClientSideTranslatorPB.java:246) at org.apache.hadoop.hdfs.server.datanode.IncrementalBlockReportManager.sendIBRs(IncrementalBlockReportManager.java:218) at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.offerService(BPServiceActor.java:749) at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:920) at java.lang.Thread.run(Thread.java:748)
The root cause is in BPOfferService#notifyNamenodeBlock, happens when it's called on a block belonging to a volume already removed prior. Because the volume was already removed
private void notifyNamenodeBlock(ExtendedBlock block, BlockStatus status, String delHint, String storageUuid, boolean isOnTransientStorage) { checkBlock(block); final ReceivedDeletedBlockInfo info = new ReceivedDeletedBlockInfo( block.getLocalBlock(), status, delHint); final DatanodeStorage storage = dn.getFSDataset().getStorage(storageUuid); // storage == null here because it's already removed earlier. for (BPServiceActor actor : bpServices) { actor.getIbrManager().notifyNamenodeBlock(info, storage, isOnTransientStorage); } }
so IBRs with a null storage are now pending.
The reason why notifyNamenodeBlock can trigger on such blocks is up in DirectoryScanner#reconcile
public void reconcile() throws IOException { LOG.debug("reconcile start DirectoryScanning"); scan(); // If a volume is removed here after scan() already finished running, // diffs is stale and checkAndUpdate will run on a removed volume // HDFS-14476: run checkAndUpdate with batch to avoid holding the lock too // long int loopCount = 0; synchronized (diffs) { for (final Map.Entry<String, ScanInfo> entry : diffs.getEntries()) { dataset.checkAndUpdate(entry.getKey(), entry.getValue()); ... }
Inside checkAndUpdate, memBlockInfo is null because all the block meta in memory is removed during the volume removal, but diskFile still exists. Then DataNode#notifyNamenodeDeletedBlock (and further down the line, notifyNamenodeBlock) is called on this block.
Attachments
Issue Links
- links to