Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-44543

Cleanup .spark-staging directories when yarn application fails

    XMLWordPrintableJSON

Details

    • New Feature
    • Status: Open
    • Major
    • Resolution: Unresolved
    • 3.4.1
    • None
    • Spark Core, Spark Shell
    • None

    Description

      Spark creates the staging directories like .hive-staging, .spark-staging etc which get created when you run an dynamic insert overwrite to a partitioned table. Spark spends maximum time in renaming the partitioned files, and because GCS renaming are too slow, there are frequent scenarios where YARN fails due to network error etc.

      Such directories will remain forever in Google Cloud Storage, in case the yarn application manager gets killed.
       
      Over time this pileup and incurs a lot of cloud storage cost.
       
      Can we update our File committer to clean up the temporary directories in case the job commit fails.

      PS : This request is specifically for GCS.
      Image for reference 
       

      Attachments

        Activity

          People

            Unassigned Unassigned
            dipayandev Dipayan Dev
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated: