[HIVE-22690] When the directories from HDFS are deleted while running MSCK it fails with FileNotFoundException - ASF JIRA

Details

Type: Bug
Status: Open
Priority: Major
Resolution: Unresolved
Affects Version/s: 2.1.1
Fix Version/s: None
Component/s: Hive
Labels:
None

Description

Assume a table `emp` defined as follows

create external table 
    emp (id int, name string) 
partitioned by 
    (dept string)
location
    'hdfs://namenode.com:8020/hive/data/db/emp'
;

Create say 1000 partitions in the HDFS

Now to synchronize the MetaStore, if we run the MSCK command and parallely delete the HDFS directories, at some point MSCK fails with FieNotFoundException. Here is the stack trace.

2019-12-10 23:21:50,027 WARN  hive.ql.exec.DDLTask: [HiveServer2-Background-Pool: Thread-500224]: Failed to run metacheck: 
org.apache.hadoop.hive.ql.metadata.HiveException: java.io.FileNotFoundException: File hdfs://namenode.com:8020/hive/data/db/emp/dept=CS does not exist.
	at org.apache.hadoop.hive.ql.metadata.HiveMetaStoreChecker.checkPartitionDirs(HiveMetaStoreChecker.java:554) ~[hive-exec-2.1.1-cdh6.2.1.jar:2.1.1-cdh6.2.1]
	at org.apache.hadoop.hive.ql.metadata.HiveMetaStoreChecker.checkPartitionDirs(HiveMetaStoreChecker.java:443) ~[hive-exec-2.1.1-cdh6.2.1.jar:2.1.1-cdh6.2.1]
	at org.apache.hadoop.hive.ql.metadata.HiveMetaStoreChecker.findUnknownPartitions(HiveMetaStoreChecker.java:334) ~[hive-exec-2.1.1-cdh6.2.1.jar:2.1.1-cdh6.2.1]
	at org.apache.hadoop.hive.ql.metadata.HiveMetaStoreChecker.checkTable(HiveMetaStoreChecker.java:310) ~[hive-exec-2.1.1-cdh6.2.1.jar:2.1.1-cdh6.2.1]
	at org.apache.hadoop.hive.ql.metadata.HiveMetaStoreChecker.checkTable(HiveMetaStoreChecker.java:253) ~[hive-exec-2.1.1-cdh6.2.1.jar:2.1.1-cdh6.2.1]
	at org.apache.hadoop.hive.ql.metadata.HiveMetaStoreChecker.checkMetastore(HiveMetaStoreChecker.java:118) ~[hive-exec-2.1.1-cdh6.2.1.jar:2.1.1-cdh6.2.1]
	at org.apache.hadoop.hive.ql.exec.DDLTask.msck(DDLTask.java:1862) [hive-exec-2.1.1-cdh6.2.1.jar:2.1.1-cdh6.2.1]
	at org.apache.hadoop.hive.ql.exec.DDLTask.execute(DDLTask.java:413) [hive-exec-2.1.1-cdh6.2.1.jar:2.1.1-cdh6.2.1]
	at org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:199) [hive-exec-2.1.1-cdh6.2.1.jar:2.1.1-cdh6.2.1]
	at org.apache.hadoop.hive.ql.exec.TaskRunner.runSequential(TaskRunner.java:97) [hive-exec-2.1.1-cdh6.2.1.jar:2.1.1-cdh6.2.1]
	at org.apache.hadoop.hive.ql.Driver.launchTask(Driver.java:2200) [hive-exec-2.1.1-cdh6.2.1.jar:2.1.1-cdh6.2.1]
	at org.apache.hadoop.hive.ql.Driver.execute(Driver.java:1843) [hive-exec-2.1.1-cdh6.2.1.jar:2.1.1-cdh6.2.1]
	at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:1563) [hive-exec-2.1.1-cdh6.2.1.jar:2.1.1-cdh6.2.1]
	at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1339) [hive-exec-2.1.1-cdh6.2.1.jar:2.1.1-cdh6.2.1]
	at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1334) [hive-exec-2.1.1-cdh6.2.1.jar:2.1.1-cdh6.2.1]
	at org.apache.hive.service.cli.operation.SQLOperation.runQuery(SQLOperation.java:256) [hive-service-2.1.1-cdh6.2.1.jar:2.1.1-cdh6.2.1]
	at org.apache.hive.service.cli.operation.SQLOperation.access$600(SQLOperation.java:92) [hive-service-2.1.1-cdh6.2.1.jar:2.1.1-cdh6.2.1]
	at org.apache.hive.service.cli.operation.SQLOperation$BackgroundWork$1.run(SQLOperation.java:345) [hive-service-2.1.1-cdh6.2.1.jar:2.1.1-cdh6.2.1]
	at java.security.AccessController.doPrivileged(Native Method) ~[?:1.8.0_121]
	at javax.security.auth.Subject.doAs(Subject.java:422) [?:1.8.0_121]
	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1875) [hadoop-common-3.0.0-cdh6.2.1.jar:?]
	at org.apache.hive.service.cli.operation.SQLOperation$BackgroundWork.run(SQLOperation.java:357) [hive-service-2.1.1-cdh6.2.1.jar:2.1.1-cdh6.2.1]
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [?:1.8.0_121]
	at java.util.concurrent.FutureTask.run(FutureTask.java:266) [?:1.8.0_121]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [?:1.8.0_121]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [?:1.8.0_121]
	at java.lang.Thread.run(Thread.java:745) [?:1.8.0_121]
Caused by: java.io.FileNotFoundException: File hdfs://namenode.com:8020/hive/data/db/emp/dept=CS does not exist.
	at org.apache.hadoop.hdfs.DistributedFileSystem.listStatusInternal(DistributedFileSystem.java:985) ~[hadoop-hdfs-client-3.0.0-cdh6.2.1.jar:?]
	at org.apache.hadoop.hdfs.DistributedFileSystem.access$1000(DistributedFileSystem.java:121) ~[hadoop-hdfs-client-3.0.0-cdh6.2.1.jar:?]
	at org.apache.hadoop.hdfs.DistributedFileSystem$24.doCall(DistributedFileSystem.java:1045) ~[hadoop-hdfs-client-3.0.0-cdh6.2.1.jar:?]
	at org.apache.hadoop.hdfs.DistributedFileSystem$24.doCall(DistributedFileSystem.java:1042) ~[hadoop-hdfs-client-3.0.0-cdh6.2.1.jar:?]
	at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81) ~[hadoop-common-3.0.0-cdh6.2.1.jar:?]
	at org.apache.hadoop.hdfs.DistributedFileSystem.listStatus(DistributedFileSystem.java:1052) ~[hadoop-hdfs-client-3.0.0-cdh6.2.1.jar:?]
	at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1853) ~[hadoop-common-3.0.0-cdh6.2.1.jar:?]
	at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1895) ~[hadoop-common-3.0.0-cdh6.2.1.jar:?]
	at org.apache.hadoop.hive.ql.metadata.HiveMetaStoreChecker$PathDepthInfoCallable.processPathDepthInfo(HiveMetaStoreChecker.java:474) ~[hive-exec-2.1.1-cdh6.2.1.jar:2.1.1-cdh6.2.1]
	at org.apache.hadoop.hive.ql.metadata.HiveMetaStoreChecker$PathDepthInfoCallable.call(HiveMetaStoreChecker.java:467) ~[hive-exec-2.1.1-cdh6.2.1.jar:2.1.1-cdh6.2.1]
	at org.apache.hadoop.hive.ql.metadata.HiveMetaStoreChecker$PathDepthInfoCallable.call(HiveMetaStoreChecker.java:448) ~[hive-exec-2.1.1-cdh6.2.1.jar:2.1.1-cdh6.2.1]
	... 4 more

I analyzed the stack trace and found that the problem is in class HiveMetaStoreChecker::processPathDepthInfo [1]

What we are doing here is

Create a Q
Put the table's data directory in the Q
Start few threads which explore the directories in Q and add the newly discovered ones to the Q.

This process has a flaw. Say there are 1000 first level directories and 1000*500 second level directories, then we can prove that there exists sufficient amount of time between putting a path in the Q and exploring the content of the same directory. This time is large enough to do a HDFS delete and if done so results in the above failure.

What can be the improvement.

[best according to me] Consume the exception and may be print it in DEBUG mode
Check the existence of the directory before listing the content in it.

References:

[1] https://github.com/apache/hive/blob/01faca2f9d7dcb0f5feabfcb07fa5ea12b79c5b9/standalone-metastore/metastore-server/src/main/java/org/apache/hadoop/hive/metastore/HiveMetaStoreChecker.java#L474

When the directories from HDFS are deleted while running MSCK it fails with FileNotFoundException

Details

Description

Attachments

Activity

People

Dates