Details
-
Wish
-
Status: Resolved
-
Minor
-
Resolution: Invalid
-
2.4.4, 3.0.0
-
None
-
None
Description
It is very useful to have a option for omiting the columns used in partitioning from the output while writing to a file data source like csv, avro, parquet, orc or excel.
Consider the following code:
Dataset<Row> myDF = spark.createDataFrame(myRDD, MyClass.class);
myDF.select("value1", "value2", "year","month","day")
.write().format("csv")
.option("header", "true")
.partionBy("year","month","day")
.save("hdfs://user/spark/warehouse/csv_output_dir");
This will output many files in separated folders in a structure like:
csv_output_dir/_SUCCESS
csv_output_dir/year=2019/month=7/day=10/part-00000-ac09671e-5ee3-4479-ae83-5301aa7f424b.c000.csv
csv_output_dir/year=2019/month=7/day=11/part-00000-ac09671e-5ee3-4479-ae83-5301aa7f424b.c000.csv
...
And the output will be something like:
┌──────┬──────┬──────┬───────┬─────┐
│ val1 │ val2 │ year │ month │ day │
├──────┼──────┼──────┼───────┼─────┤
│ 3673 │ 2345 │ 2019 │ 7 │ 10 │
│ 2345 │ 3423 │ 2019 │ 7 │ 10 │
│ 8765 │ 2423 │ 2019 │ 7 │ 10 │
└──────┴──────┴──────┴───────┴─────┘
When using partitioning in HIVE, the output from same source data will be something like:
┌──────┬──────┐
│ val1 │ val2 │
├──────┼──────┤
│ 3673 │ 2345 │
│ 2345 │ 3423 │
│ 8765 │ 2423 │
└──────┴──────┘
In this case the columns of the partitioning are not present in the CSV files. However output files follows the same folder/path structure as existing today.
Please considere adding a opt-in config for DataFrameWriter for leaving out the partitioning columns as in the second example.
The code could be something like:
Dataset<Row> myDF = spark.createDataFrame(myRDD, MyClass.class);
myDF.select("value1", "value2", "year","month","day")
.write().format("csv")
.option("header", "true")
.option("partition.omit.cols", "true")
.partionBy("year","month","day")
.save("hdfs://user/spark/warehouse/csv_output_dir");
Thanks.