Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-28505

Add data source option for omitting partitioned columns when saving to file

    XMLWordPrintableJSON

    Details

    • Type: Wish
    • Status: Resolved
    • Priority: Minor
    • Resolution: Invalid
    • Affects Version/s: 2.4.4, 3.0.0
    • Fix Version/s: None
    • Component/s: Input/Output, Spark Core
    • Labels:
      None

      Description

      It is very useful to have a option for omiting the columns used in partitioning from the output while writing to a file data source like csv, avro, parquet, orc or excel.

      Consider the following code:

      Dataset<Row> myDF = spark.createDataFrame(myRDD, MyClass.class);
      myDF.select("value1", "value2", "year","month","day")
      .write().format("csv")
      .option("header", "true")
      .partionBy("year","month","day")
      .save("hdfs://user/spark/warehouse/csv_output_dir");

      This will output many files in separated folders in a structure like:

      csv_output_dir/_SUCCESS
      csv_output_dir/year=2019/month=7/day=10/part-00000-ac09671e-5ee3-4479-ae83-5301aa7f424b.c000.csv
      csv_output_dir/year=2019/month=7/day=11/part-00000-ac09671e-5ee3-4479-ae83-5301aa7f424b.c000.csv
      ...

      And the output will be something like:

      ┌──────┬──────┬──────┬───────┬─────┐
      │ val1 │ val2 │ year │ month │ day │
      ├──────┼──────┼──────┼───────┼─────┤
      │ 3673 │ 2345 │ 2019 │     7 │ 10  │
      │ 2345 │ 3423 │ 2019 │     7 │ 10  │
      │ 8765 │ 2423 │ 2019 │     7 │ 10  │
      └──────┴──────┴──────┴───────┴─────┘

      When using partitioning in HIVE, the output from same source data will be something like:

      ┌──────┬──────┐
      │ val1 │ val2 │
      ├──────┼──────┤
      │ 3673 │ 2345 │
      │ 2345 │ 3423 │
      │ 8765 │ 2423 │
      └──────┴──────┘

      In this case the columns of the partitioning are not present in the CSV files. However output files follows the same folder/path structure as existing today.

      Please considere adding a opt-in config for DataFrameWriter for leaving out the partitioning columns as in the second example.

      The code could be something like:

      Dataset<Row> myDF = spark.createDataFrame(myRDD, MyClass.class);
      myDF.select("value1", "value2", "year","month","day")
      .write().format("csv")
      .option("header", "true")
      .option("partition.omit.cols", "true")
      .partionBy("year","month","day")
      .save("hdfs://user/spark/warehouse/csv_output_dir");

      Thanks.
       

        Attachments

          Activity

            People

            • Assignee:
              Unassigned
              Reporter:
              juarezr Juarez Rudsatz
            • Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: