Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-8014

DataFrame.write.mode("error").save(...) should not scan the output folder

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 1.4.0
    • 1.4.0
    • SQL
    • None

    Description

      When saving a DataFrame with ErrorIfExists as save mode, we shouldn't do metadata discovery if the destination folder exists. This also applies to SaveMode.Overwrite and SaveMode.Ignore.

      To reproduce this issue, we may make an empty directory /tmp/foo and leave an empty file bar there, then execute the following code in Spark shell:

      import sqlContext._
      import sqlContext.implicits._
      
      Seq(1 -> "a").toDF("i", "s").write.format("parquet").mode("error").save("file:///tmp/foo")
      

      From the exception stack trace we can see that metadata discovery code path is executed:

      java.io.IOException: Could not read footer: java.lang.RuntimeException: file:/tmp/foo/bar is not a Parquet file (too small)
              at parquet.hadoop.ParquetFileReader.readAllFootersInParallel(ParquetFileReader.java:238)
              at org.apache.spark.sql.parquet.ParquetRelation2$MetadataCache.refresh(newParquet.scala:369)
              at org.apache.spark.sql.parquet.ParquetRelation2.org$apache$spark$sql$parquet$ParquetRelation2$$metadataCache$lzycompute(newParquet.scala:154)
              at org.apache.spark.sql.parquet.ParquetRelation2.org$apache$spark$sql$parquet$ParquetRelation2$$metadataCache(newParquet.scala:152)
              at org.apache.spark.sql.parquet.ParquetRelation2.dataSchema(newParquet.scala:193)
              at org.apache.spark.sql.sources.HadoopFsRelation.schema$lzycompute(interfaces.scala:502)
              at org.apache.spark.sql.sources.HadoopFsRelation.schema(interfaces.scala:501)
              at org.apache.spark.sql.sources.ResolvedDataSource$.apply(ddl.scala:331)
              at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:144)
              at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:135)
              ...
      Caused by: java.lang.RuntimeException: file:/tmp/foo/bar is not a Parquet file (too small)
              at parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:408)
              at parquet.hadoop.ParquetFileReader$2.call(ParquetFileReader.java:228)
              at parquet.hadoop.ParquetFileReader$2.call(ParquetFileReader.java:224)
              at java.util.concurrent.FutureTask.run(FutureTask.java:266)
              at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
              at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
              at java.lang.Thread.run(Thread.java:745)
      

      Attachments

        Activity

          People

            lian cheng Cheng Lian
            huangjs Jianshi Huang
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: