Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-14117

write.partitionBy retains partitioning column when outputting Parquet

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Minor
    • Resolution: Fixed
    • 1.6.1
    • None
    • SQL
    • None

    Description

      When writing a Dataframe as parquet using a partitionBy on the writer to generate multiple output folders, the resulting parquet files have columns containing the partitioning column.

      Here's a simple example:

      df = sql.createDataFrame([
        Row(a="folder 1 message 1", folder="folder1"),
        Row(a="folder 1 message 2", folder="folder1"),
        Row(a="folder 1 message 3", folder="folder1"),
        Row(a="folder 2 message 1", folder="folder2"),
        Row(a="folder 2 message 2", folder="folder2"),
        Row(a="folder 2 message 3", folder="folder2"),
      ])
      
      df.write.partitionBy('folder').parquet('output')
      

      produces the following output :-

      +------------------+-------+
      |                 a| folder|
      +------------------+-------+
      |folder 2 message 1|folder2|
      +------------------+-------+
      

      whereas

      df.write.partitionBy('folder').json('output')
      

      produces :-

      {"a":"folder 2 message 1"}
      

      without the partitioning column.

      I'm assuming this is a bug because of the different behaviour between the two.

      Attachments

        Activity

          People

            Unassigned Unassigned
            franklynDsouza Franklyn Dsouza
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: