Details
-
New Feature
-
Status: Open
-
Major
-
Resolution: Unresolved
-
3.4.1
-
None
-
None
Description
Spark creates the staging directories like .hive-staging, .spark-staging etc which get created when you run an dynamic insert overwrite to a partitioned table. Spark spends maximum time in renaming the partitioned files, and because GCS renaming are too slow, there are frequent scenarios where YARN fails due to network error etc.
Such directories will remain forever in Google Cloud Storage, in case the yarn application manager gets killed.
Over time this pileup and incurs a lot of cloud storage cost.
Can we update our File committer to clean up the temporary directories in case the job commit fails.
PS : This request is specifically for GCS.
Image for reference