Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-25103

CompletionIterator may delay GC of completed resources

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Incomplete
    • 2.0.1, 2.1.0, 2.2.0, 2.3.0
    • None
    • Spark Core

    Description

      while working on SPARK-22713 , I fund (and partially fixed) a scenario in which an iterator is already exhausted but still holds a reference to some resources that can be GCed at this point.

      However, these resources can not be GCed because of this reference.

      the specific fix applied in SPARK-22713 was to wrap the iterator with a CompletionIterator that cleans it when exhausted, thing is that it's quite easy to get this wrong by closing over local variables or this reference in the cleanup function itself.

      I propose solving this by modifying CompletionIterator to discard references to the wrapped iterator and cleanup function once exhausted.

       

      • a dive into the code showed that most CompletionIterators are eventually used by 
        org.apache.spark.scheduler.ShuffleMapTask#runTask

        which does:

      writer.write(rdd.iterator(partition, context).asInstanceOf[Iterator[_ <: Product2[Any, Any]]])

      looking at 

      org.apache.spark.shuffle.ShuffleWriter#write

      implementations, it seems all of them first exhaust the iterator and then perform some kind of post-processing: i.e. merging spills, sorting, writing partitions files and then concatenating them into a single file... bottom line the Iterator may actually be 'sitting' for some time after being exhausted.

       

      Attachments

        Activity

          People

            Unassigned Unassigned
            eyalfa Eyal Farago
            Votes:
            1 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: