[SPARK-8014] DataFrame.write.mode("error").save(...) should not scan the output folder - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 1.4.0
Fix Version/s: 1.4.0
Component/s: SQL
Labels:
None

Target Version/s:

1.4.1, 1.5.0

Description

When saving a DataFrame with ErrorIfExists as save mode, we shouldn't do metadata discovery if the destination folder exists. This also applies to SaveMode.Overwrite and SaveMode.Ignore.

To reproduce this issue, we may make an empty directory /tmp/foo and leave an empty file bar there, then execute the following code in Spark shell:

import sqlContext._
import sqlContext.implicits._

Seq(1 -> "a").toDF("i", "s").write.format("parquet").mode("error").save("file:///tmp/foo")

From the exception stack trace we can see that metadata discovery code path is executed:

java.io.IOException: Could not read footer: java.lang.RuntimeException: file:/tmp/foo/bar is not a Parquet file (too small)
        at parquet.hadoop.ParquetFileReader.readAllFootersInParallel(ParquetFileReader.java:238)
        at org.apache.spark.sql.parquet.ParquetRelation2$MetadataCache.refresh(newParquet.scala:369)
        at org.apache.spark.sql.parquet.ParquetRelation2.org$apache$spark$sql$parquet$ParquetRelation2$$metadataCache$lzycompute(newParquet.scala:154)
        at org.apache.spark.sql.parquet.ParquetRelation2.org$apache$spark$sql$parquet$ParquetRelation2$$metadataCache(newParquet.scala:152)
        at org.apache.spark.sql.parquet.ParquetRelation2.dataSchema(newParquet.scala:193)
        at org.apache.spark.sql.sources.HadoopFsRelation.schema$lzycompute(interfaces.scala:502)
        at org.apache.spark.sql.sources.HadoopFsRelation.schema(interfaces.scala:501)
        at org.apache.spark.sql.sources.ResolvedDataSource$.apply(ddl.scala:331)
        at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:144)
        at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:135)
        ...
Caused by: java.lang.RuntimeException: file:/tmp/foo/bar is not a Parquet file (too small)
        at parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:408)
        at parquet.hadoop.ParquetFileReader$2.call(ParquetFileReader.java:228)
        at parquet.hadoop.ParquetFileReader$2.call(ParquetFileReader.java:224)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)

Attachments

Issue Links

links to

[Github] Pull Request #6583 (liancheng)

Activity

People

Assignee:: Cheng Lian

Reporter:: Jianshi Huang

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 01/Jun/15 17:00

Updated:: 02/Jun/15 20:32

Resolved:: 02/Jun/15 20:32