Details
-
Bug
-
Status: Open
-
Major
-
Resolution: Unresolved
-
2.1.1
-
None
-
None
Description
Assume a table `emp` defined as follows
create external table emp (id int, name string) partitioned by (dept string) location 'hdfs://namenode.com:8020/hive/data/db/emp' ;
Create say 1000 partitions in the HDFS
Now to synchronize the MetaStore, if we run the MSCK command and parallely delete the HDFS directories, at some point MSCK fails with FieNotFoundException. Here is the stack trace.
2019-12-10 23:21:50,027 WARN hive.ql.exec.DDLTask: [HiveServer2-Background-Pool: Thread-500224]: Failed to run metacheck: org.apache.hadoop.hive.ql.metadata.HiveException: java.io.FileNotFoundException: File hdfs://namenode.com:8020/hive/data/db/emp/dept=CS does not exist. at org.apache.hadoop.hive.ql.metadata.HiveMetaStoreChecker.checkPartitionDirs(HiveMetaStoreChecker.java:554) ~[hive-exec-2.1.1-cdh6.2.1.jar:2.1.1-cdh6.2.1] at org.apache.hadoop.hive.ql.metadata.HiveMetaStoreChecker.checkPartitionDirs(HiveMetaStoreChecker.java:443) ~[hive-exec-2.1.1-cdh6.2.1.jar:2.1.1-cdh6.2.1] at org.apache.hadoop.hive.ql.metadata.HiveMetaStoreChecker.findUnknownPartitions(HiveMetaStoreChecker.java:334) ~[hive-exec-2.1.1-cdh6.2.1.jar:2.1.1-cdh6.2.1] at org.apache.hadoop.hive.ql.metadata.HiveMetaStoreChecker.checkTable(HiveMetaStoreChecker.java:310) ~[hive-exec-2.1.1-cdh6.2.1.jar:2.1.1-cdh6.2.1] at org.apache.hadoop.hive.ql.metadata.HiveMetaStoreChecker.checkTable(HiveMetaStoreChecker.java:253) ~[hive-exec-2.1.1-cdh6.2.1.jar:2.1.1-cdh6.2.1] at org.apache.hadoop.hive.ql.metadata.HiveMetaStoreChecker.checkMetastore(HiveMetaStoreChecker.java:118) ~[hive-exec-2.1.1-cdh6.2.1.jar:2.1.1-cdh6.2.1] at org.apache.hadoop.hive.ql.exec.DDLTask.msck(DDLTask.java:1862) [hive-exec-2.1.1-cdh6.2.1.jar:2.1.1-cdh6.2.1] at org.apache.hadoop.hive.ql.exec.DDLTask.execute(DDLTask.java:413) [hive-exec-2.1.1-cdh6.2.1.jar:2.1.1-cdh6.2.1] at org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:199) [hive-exec-2.1.1-cdh6.2.1.jar:2.1.1-cdh6.2.1] at org.apache.hadoop.hive.ql.exec.TaskRunner.runSequential(TaskRunner.java:97) [hive-exec-2.1.1-cdh6.2.1.jar:2.1.1-cdh6.2.1] at org.apache.hadoop.hive.ql.Driver.launchTask(Driver.java:2200) [hive-exec-2.1.1-cdh6.2.1.jar:2.1.1-cdh6.2.1] at org.apache.hadoop.hive.ql.Driver.execute(Driver.java:1843) [hive-exec-2.1.1-cdh6.2.1.jar:2.1.1-cdh6.2.1] at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:1563) [hive-exec-2.1.1-cdh6.2.1.jar:2.1.1-cdh6.2.1] at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1339) [hive-exec-2.1.1-cdh6.2.1.jar:2.1.1-cdh6.2.1] at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1334) [hive-exec-2.1.1-cdh6.2.1.jar:2.1.1-cdh6.2.1] at org.apache.hive.service.cli.operation.SQLOperation.runQuery(SQLOperation.java:256) [hive-service-2.1.1-cdh6.2.1.jar:2.1.1-cdh6.2.1] at org.apache.hive.service.cli.operation.SQLOperation.access$600(SQLOperation.java:92) [hive-service-2.1.1-cdh6.2.1.jar:2.1.1-cdh6.2.1] at org.apache.hive.service.cli.operation.SQLOperation$BackgroundWork$1.run(SQLOperation.java:345) [hive-service-2.1.1-cdh6.2.1.jar:2.1.1-cdh6.2.1] at java.security.AccessController.doPrivileged(Native Method) ~[?:1.8.0_121] at javax.security.auth.Subject.doAs(Subject.java:422) [?:1.8.0_121] at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1875) [hadoop-common-3.0.0-cdh6.2.1.jar:?] at org.apache.hive.service.cli.operation.SQLOperation$BackgroundWork.run(SQLOperation.java:357) [hive-service-2.1.1-cdh6.2.1.jar:2.1.1-cdh6.2.1] at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [?:1.8.0_121] at java.util.concurrent.FutureTask.run(FutureTask.java:266) [?:1.8.0_121] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [?:1.8.0_121] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [?:1.8.0_121] at java.lang.Thread.run(Thread.java:745) [?:1.8.0_121] Caused by: java.io.FileNotFoundException: File hdfs://namenode.com:8020/hive/data/db/emp/dept=CS does not exist. at org.apache.hadoop.hdfs.DistributedFileSystem.listStatusInternal(DistributedFileSystem.java:985) ~[hadoop-hdfs-client-3.0.0-cdh6.2.1.jar:?] at org.apache.hadoop.hdfs.DistributedFileSystem.access$1000(DistributedFileSystem.java:121) ~[hadoop-hdfs-client-3.0.0-cdh6.2.1.jar:?] at org.apache.hadoop.hdfs.DistributedFileSystem$24.doCall(DistributedFileSystem.java:1045) ~[hadoop-hdfs-client-3.0.0-cdh6.2.1.jar:?] at org.apache.hadoop.hdfs.DistributedFileSystem$24.doCall(DistributedFileSystem.java:1042) ~[hadoop-hdfs-client-3.0.0-cdh6.2.1.jar:?] at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81) ~[hadoop-common-3.0.0-cdh6.2.1.jar:?] at org.apache.hadoop.hdfs.DistributedFileSystem.listStatus(DistributedFileSystem.java:1052) ~[hadoop-hdfs-client-3.0.0-cdh6.2.1.jar:?] at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1853) ~[hadoop-common-3.0.0-cdh6.2.1.jar:?] at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1895) ~[hadoop-common-3.0.0-cdh6.2.1.jar:?] at org.apache.hadoop.hive.ql.metadata.HiveMetaStoreChecker$PathDepthInfoCallable.processPathDepthInfo(HiveMetaStoreChecker.java:474) ~[hive-exec-2.1.1-cdh6.2.1.jar:2.1.1-cdh6.2.1] at org.apache.hadoop.hive.ql.metadata.HiveMetaStoreChecker$PathDepthInfoCallable.call(HiveMetaStoreChecker.java:467) ~[hive-exec-2.1.1-cdh6.2.1.jar:2.1.1-cdh6.2.1] at org.apache.hadoop.hive.ql.metadata.HiveMetaStoreChecker$PathDepthInfoCallable.call(HiveMetaStoreChecker.java:448) ~[hive-exec-2.1.1-cdh6.2.1.jar:2.1.1-cdh6.2.1] ... 4 more
I analyzed the stack trace and found that the problem is in class HiveMetaStoreChecker::processPathDepthInfo [1]
What we are doing here is
- Create a Q
- Put the table's data directory in the Q
- Start few threads which explore the directories in Q and add the newly discovered ones to the Q.
This process has a flaw. Say there are 1000 first level directories and 1000*500 second level directories, then we can prove that there exists sufficient amount of time between putting a path in the Q and exploring the content of the same directory. This time is large enough to do a HDFS delete and if done so results in the above failure.
What can be the improvement.
- [best according to me] Consume the exception and may be print it in DEBUG mode
- Check the existence of the directory before listing the content in it.
References: