[SPARK-28505] Add data source option for omitting partitioned columns when saving to file - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Wish
Status: Resolved
Priority: Minor
Resolution: Invalid
Affects Version/s: 2.4.4, 3.0.0
Fix Version/s: None
Component/s: Input/Output, Spark Core
Labels:
None

Description

It is very useful to have a option for omiting the columns used in partitioning from the output while writing to a file data source like csv, avro, parquet, orc or excel.

Consider the following code:

Dataset<Row> myDF = spark.createDataFrame(myRDD, MyClass.class);
myDF.select("value1", "value2", "year","month","day")
.write().format("csv")
.option("header", "true")
.partionBy("year","month","day")
.save("hdfs://user/spark/warehouse/csv_output_dir");

This will output many files in separated folders in a structure like:

csv_output_dir/_SUCCESS
csv_output_dir/year=2019/month=7/day=10/part-00000-ac09671e-5ee3-4479-ae83-5301aa7f424b.c000.csv
csv_output_dir/year=2019/month=7/day=11/part-00000-ac09671e-5ee3-4479-ae83-5301aa7f424b.c000.csv
...

And the output will be something like:

┌──────┬──────┬──────┬───────┬─────┐
│ val1 │ val2 │ year │ month │ day │
├──────┼──────┼──────┼───────┼─────┤
│ 3673 │ 2345 │ 2019 │ 7 │ 10 │
│ 2345 │ 3423 │ 2019 │ 7 │ 10 │
│ 8765 │ 2423 │ 2019 │ 7 │ 10 │
└──────┴──────┴──────┴───────┴─────┘

When using partitioning in HIVE, the output from same source data will be something like:

┌──────┬──────┐
│ val1 │ val2 │
├──────┼──────┤
│ 3673 │ 2345 │
│ 2345 │ 3423 │
│ 8765 │ 2423 │
└──────┴──────┘

In this case the columns of the partitioning are not present in the CSV files. However output files follows the same folder/path structure as existing today.

Please considere adding a opt-in config for DataFrameWriter for leaving out the partitioning columns as in the second example.

The code could be something like:

Dataset<Row> myDF = spark.createDataFrame(myRDD, MyClass.class);
myDF.select("value1", "value2", "year","month","day")
.write().format("csv")
.option("header", "true")
.option("partition.omit.cols", "true")
.partionBy("year","month","day")
.save("hdfs://user/spark/warehouse/csv_output_dir");

Thanks.

Attachments

Activity

People

Assignee:: Unassigned

Reporter:: Juarez Rudsatz

Votes:: 0 Vote for this issue

Watchers:: 1 Start watching this issue

Dates

Created:: 24/Jul/19 16:53

Updated:: 12/Dec/22 18:10

Resolved:: 26/Jul/19 10:00