[SPARK-14117] write.partitionBy retains partitioning column when outputting Parquet - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Closed
Priority: Minor
Resolution: Fixed
Affects Version/s: 1.6.1
Fix Version/s: None
Component/s: SQL
Labels:
None

Target Version/s:

1.6.1

Description

When writing a Dataframe as parquet using a partitionBy on the writer to generate multiple output folders, the resulting parquet files have columns containing the partitioning column.

Here's a simple example:

df = sql.createDataFrame([
  Row(a="folder 1 message 1", folder="folder1"),
  Row(a="folder 1 message 2", folder="folder1"),
  Row(a="folder 1 message 3", folder="folder1"),
  Row(a="folder 2 message 1", folder="folder2"),
  Row(a="folder 2 message 2", folder="folder2"),
  Row(a="folder 2 message 3", folder="folder2"),
])

df.write.partitionBy('folder').parquet('output')

produces the following output :-

+------------------+-------+
|                 a| folder|
+------------------+-------+
|folder 2 message 1|folder2|
+------------------+-------+

whereas

df.write.partitionBy('folder').json('output')

produces :-

{"a":"folder 2 message 1"}

without the partitioning column.

I'm assuming this is a bug because of the different behaviour between the two.

Attachments

Activity

People

Assignee:: Unassigned

Reporter:: Franklyn Dsouza

Votes:: 0 Vote for this issue

Watchers:: 1 Start watching this issue

Dates

Created:: 24/Mar/16 16:07

Updated:: 25/Mar/16 00:18

Resolved:: 25/Mar/16 00:18