Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-8037

Ignores files whose name starts with "." while enumerating files in HadoopFsRelation

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Minor
    • Resolution: Fixed
    • 1.4.0
    • 1.4.0
    • SQL
    • None

    Description

      Temporary files like .DS_Store generated by Mac OS X finder may cause trouble for partition discovery. A directory whose layout looks like the following

      > find parquet_partitioned
      parquet_partitioned
      parquet_partitioned/._common_metadata.crc
      parquet_partitioned/._metadata.crc
      parquet_partitioned/._SUCCESS.crc
      parquet_partitioned/_common_metadata
      parquet_partitioned/_metadata
      parquet_partitioned/_SUCCESS
      parquet_partitioned/year=2014/.DS_Store
      parquet_partitioned/year=2014/month=9
      parquet_partitioned/year=2014/month=9/.DS_Store
      parquet_partitioned/year=2014/month=9/day=1/.DS_Store
      parquet_partitioned/year=2014/month=9/day=1/.part-r-00008.gz.parquet.crc
      parquet_partitioned/year=2014/month=9/day=1/part-r-00008.gz.parquet
      parquet_partitioned/year=2015
      parquet_partitioned/year=2015/month=10
      parquet_partitioned/year=2015/month=10/day=25
      parquet_partitioned/year=2015/month=10/day=25/.part-r-00002.gz.parquet.crc
      parquet_partitioned/year=2015/month=10/day=25/.part-r-00004.gz.parquet.crc
      parquet_partitioned/year=2015/month=10/day=25/part-r-00002.gz.parquet
      parquet_partitioned/year=2015/month=10/day=25/part-r-00004.gz.parquet
      parquet_partitioned/year=2015/month=10/day=26
      parquet_partitioned/year=2015/month=10/day=26/.part-r-00005.gz.parquet.crc
      parquet_partitioned/year=2015/month=10/day=26/part-r-00005.gz.parquet
      parquet_partitioned/year=2015/month=9
      parquet_partitioned/year=2015/month=9/day=1
      parquet_partitioned/year=2015/month=9/day=1/.part-r-00007.gz.parquet.crc
      parquet_partitioned/year=2015/month=9/day=1/part-r-00007.gz.parquet
      

      causes exception like this:

      scala> val df = sqlContext.read.parquet("parquet_partitioned")
      java.lang.AssertionError: assertion failed: Conflicting partition column names detected:
          ArrayBuffer(year, month)
      ArrayBuffer(year)
      ArrayBuffer(year, month, day)
          at scala.Predef$.assert(Predef.scala:179)
          at org.apache.spark.sql.sources.PartitioningUtils$.resolvePartitions(PartitioningUtils.scala:189)
          at org.apache.spark.sql.sources.PartitioningUtils$.parsePartitions(PartitioningUtils.scala:87)
          at org.apache.spark.sql.sources.HadoopFsRelation.org$apache$spark$sql$sources$HadoopFsRelation$$discoverPartitions(interfaces.scala:492)
          at org.apache.spark.sql.sources.HadoopFsRelation$$anonfun$partitionSpec$3.apply(interfaces.scala:449)
          at org.apache.spark.sql.sources.HadoopFsRelation$$anonfun$partitionSpec$3.apply(interfaces.scala:448)
      

      This is because .DS_Store files are considered as a data file.

      Attachments

        Issue Links

          Activity

            People

              lian cheng Cheng Lian
              lian cheng Cheng Lian
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: