Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-16736

remove redundant FileSystem status checks calls from Spark codebase

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Resolved
    • Minor
    • Resolution: Fixed
    • 2.0.0
    • 2.1.0
    • Spark Core
    • None

    Description

      The Hadoop FileSystem.exists() and FileSystem.isDirectory() calls are wrappers around FileSystem.getStatus(), —the latter putting load on an HDFS NN, and very, very slow against object stores.

      1. if these calls are followed by any getStatus() calls then they can be eliminated by careful merging and pulling out the catching of {FileNotFoundException}} from the exists() call to the spark code.
      1. Any sequence of exists + delete can be optimised by removing the exists check, relying on FileSystem.delete() to be a no-op if the destination path is not present. That's a tested requirement of all Hadoop compatible FS and object stores.

      Attachments

        Issue Links

          Activity

            People

              stevel@apache.org Steve Loughran
              stevel@apache.org Steve Loughran
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: