Description
When writing a Dataframe as parquet using a partitionBy on the writer to generate multiple output folders, the resulting parquet files have columns containing the partitioning column.
Here's a simple example:
df = sql.createDataFrame([ Row(a="folder 1 message 1", folder="folder1"), Row(a="folder 1 message 2", folder="folder1"), Row(a="folder 1 message 3", folder="folder1"), Row(a="folder 2 message 1", folder="folder2"), Row(a="folder 2 message 2", folder="folder2"), Row(a="folder 2 message 3", folder="folder2"), ]) df.write.partitionBy('folder').parquet('output')
produces the following output :-
+------------------+-------+ | a| folder| +------------------+-------+ |folder 2 message 1|folder2| +------------------+-------+
whereas
df.write.partitionBy('folder').json('output')
produces :-
{"a":"folder 2 message 1"}
without the partitioning column.
I'm assuming this is a bug because of the different behaviour between the two.