Details
-
Bug
-
Status: Open
-
Major
-
Resolution: Unresolved
-
2.10.2
-
None
-
None
Description
When I performed the full-stop upgrade from 2.10.2 to 3.3.6. I noticed the following error message:
2023-08-17 10:43:11,665 ERROR org.apache.hadoop.hdfs.server.common.Storage: Error reported on storage directory Storage Directory /tmp/hadoop-root/dfs/namesecondary
2023-08-19 05:21:41,544 INFO org.apache.hadoop.hdfs.server.namenode.FSImageFormatProtobuf: Image file /tmp/hadoop-root/dfs/namesecondary/current/fsimage.ckpt_0000000000000000188 of size 2881 bytes saved in 0 seconds . 2023-08-19 05:21:41,646 ERROR org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode: RECEIVED SIGNAL 15: SIGTERM 2023-08-19 05:21:41,649 INFO org.apache.hadoop.hdfs.server.namenode.FSImage: FSImageSaver clean checkpoint: txid = 188 when meet shutdown. 2023-08-19 05:21:41,650 INFO org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode: SHUTDOWN_MSG: /************************************************************ SHUTDOWN_MSG: Shutting down SecondaryNameNode at 555840e97c97/192.168.239.3 ************************************************************/ 2023-08-19 05:21:41,714 WARN org.apache.hadoop.hdfs.server.namenode.FSImage: Unable to rename checkpoint in Storage Directory /tmp/hadoop-root/dfs/namesecondary java.io.IOException: renaming /tmp/hadoop-root/dfs/namesecondary/current/fsimage.ckpt_0000000000000000188 to /tmp/hadoop-root/dfs/namesecondary/current/fsimage_0000000000000000188 FAILED at org.apache.hadoop.hdfs.server.namenode.FSImage.renameImageFileInDir(FSImage.java:1329) at org.apache.hadoop.hdfs.server.namenode.FSImage.renameCheckpoint(FSImage.java:1263) at org.apache.hadoop.hdfs.server.namenode.FSImage.saveFSImageInAllDirs(FSImage.java:1224) at org.apache.hadoop.hdfs.server.namenode.FSImage.saveFSImageInAllDirs(FSImage.java:1172) at org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode.doMerge(SecondaryNameNode.java:1105) at org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode.doCheckpoint(SecondaryNameNode.java:563) at org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode.doWork(SecondaryNameNode.java:360) at org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode$1.run(SecondaryNameNode.java:325) at org.apache.hadoop.security.SecurityUtil.doAsLoginUserOrFatal(SecurityUtil.java:481) at org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode.run(SecondaryNameNode.java:321) at java.lang.Thread.run(Thread.java:750) 2023-08-19 05:21:41,716 ERROR org.apache.hadoop.hdfs.server.common.Storage: Error reported on storage directory Storage Directory /tmp/hadoop-root/dfs/namesecondary 2023-08-19 05:21:41,716 WARN org.apache.hadoop.hdfs.server.common.Storage: About to remove corresponding storage: /tmp/hadoop-root/dfs/namesecondary
The cluster I am using is four nodes: 1 NN, 1 SNN, 2 DN. The upgrade order is: (1) Stop SNN (2) Stop NN (3) Stop DN1 and DN2. The error message occurs at SNN when it's stopping.
The command sequence I was executing and the configurations are appended. I tried to reproduce it with the same command sequence, but it cannot be reproduced (I repeatedly execute the command sequence + upgrade) two thousand times. It might require some special timing constraints. I am not sure whether this could impact the data integrity.
== Command Sequence ==
// Start up cluster (2.10.2), 4 nodes bin/hdfs dfsadmin -safemode enter bin/hdfs dfsadmin -rollingUpgrade prepare bin/hdfs dfsadmin -safemode leave // Execute commands // Execute commands dfs -mkdir /fHPXyTkv dfs -put -f -p /tmp/XPkJEWYY/kPCH /fHPXyTkv/ dfs -put -p -d /tmp/XPkJEWYY/HdM /fHPXyTkv/kPCH/xoflDHK/lJ dfsadmin -report -live -decommissioning dfsadmin -setSpaceQuota 1 -storageType ARCHIVE /fHPXyTkv/kPCH/xoflDHK/Ykc/AP dfs -mkdir /fHPXyTkv/kPCH/xoflDHK/lJ/ozidF dfs -mv /fHPXyTkv/kPCH/xoflDHK/Ykc /fHPXyTkv/kPCH/xoflDHK/lJ dfs -mv /fHPXyTkv/kPCH/xoflDHK/lJ/AP /fHPXyTkv/kPCH/xoflDHK/eaSvvJyzZT/lL dfsadmin -report -dead -decommissioning -enteringmaintenance dfsadmin -refreshNodes dfs -mkdir /fHPXyTkv/kPCH/xoflDHK/lJ/ozidF/SpdyMzpNXmVEL dfs -setacl -k -m acl /kPCH/xoflDHK/lJ/ozidF --set acl2 /kPCH/xoflDHK/eaSvvJyzZT/lL dfsadmin -refreshNodes dfsadmin -setSpaceQuota 85 -storageType PROVIDED /fHPXyTkv/kPCH/mduNyG dfsadmin -saveNamespace dfs -put -f -p -d /tmp/XPkJEWYY/kPCH /fHPXyTkv/kPCH dfsadmin -saveNamespace dfs -mv /fHPXyTkv/kPCH/mduNyG/VZc /fHPXyTkv/kPCH/xoflDHK/Ykc/AP dfsadmin -setSpaceQuota 85 -storageType PROVIDED /fHPXyTkv/kPCH/xoflDHK/eaSvvJyzZT/lL dfs -put -f -p -d /tmp/XPkJEWYY/kPCH /fHPXyTkv/kPCH/kPCH/xoflDHK/Ykc dfsadmin -report -dead -enteringmaintenance -inmaintenance dfsadmin -setSpaceQuota 1 -storageType SSD /fHPXyTkv/kPCH/xoflDHK/JgKqDE dfs -put -f /tmp/XPkJEWYY/HdM /fHPXyTkv/kPCH/kPCH/xoflDHK/Ykc/kPCH/mduNyG/VZc dfsadmin -rollEdits dfs -cat /fHPXyTkv/kPCH/kPCH/mduNyG/YPZ dfs -ls -d -q -S -r /fHPXyTkv/kPCH dfs -ls -d -q -t -S /fHPXyTkv/kPCH/kPCH/xoflDHK/Ykc/kPCH/xoflDHK/Ykc/AP dfs -cat /fHPXyTkv/kPCH/xoflDHK/lJ/HdM dfs -cat -ignoreCrc /fHPXyTkv/kPCH/mduNyG/YPZ dfs -cat -ignoreCrc /fHPXyTkv/kPCH/kPCH/xoflDHK/Ykc/kPCH/mduNyG/YPZ dfs -ls -C -h -q -r /fHPXyTkv/kPCH/kPCH/xoflDHK/Ykc/AP dfs -cat -ignoreCrc /fHPXyTkv/kPCH/kPCH/xoflDHK/Ykc/eJBcmWE dfs -count -h -v -t DISK /fHPXyTkv/kPCH/kPCH/xoflDHK dfs -count -q -h -x -u /fHPXyTkv/kPCH/xoflDHK/lJ dfs -count -q /fHPXyTkv/kPCH/xoflDHK dfs -cat /fHPXyTkv/kPCH/kPCH/xoflDHK/Ykc/eJBcmWE dfs -ls -q -t /fHPXyTkv/kPCH/kPCH dfs -cat /fHPXyTkv/kPCH/mduNyG/YPZ dfs -cat -ignoreCrc /fHPXyTkv/kPCH/kPCH/xoflDHK/Ykc/kPCH/mduNyG/VZc/HdM // stop SNN // stop NN // stop DN1&DN2