Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-28505

Add data source option for omitting partitioned columns when saving to file

    XMLWordPrintableJSON

Details

    • Wish
    • Status: Resolved
    • Minor
    • Resolution: Invalid
    • 2.4.4, 3.0.0
    • None
    • Input/Output, Spark Core
    • None

    Description

      It is very useful to have a option for omiting the columns used in partitioning from the output while writing to a file data source like csv, avro, parquet, orc or excel.

      Consider the following code:

      Dataset<Row> myDF = spark.createDataFrame(myRDD, MyClass.class);
      myDF.select("value1", "value2", "year","month","day")
      .write().format("csv")
      .option("header", "true")
      .partionBy("year","month","day")
      .save("hdfs://user/spark/warehouse/csv_output_dir");

      This will output many files in separated folders in a structure like:

      csv_output_dir/_SUCCESS
      csv_output_dir/year=2019/month=7/day=10/part-00000-ac09671e-5ee3-4479-ae83-5301aa7f424b.c000.csv
      csv_output_dir/year=2019/month=7/day=11/part-00000-ac09671e-5ee3-4479-ae83-5301aa7f424b.c000.csv
      ...

      And the output will be something like:

      ┌──────┬──────┬──────┬───────┬─────┐
      │ val1 │ val2 │ year │ month │ day │
      ├──────┼──────┼──────┼───────┼─────┤
      │ 3673 │ 2345 │ 2019 │     7 │ 10  │
      │ 2345 │ 3423 │ 2019 │     7 │ 10  │
      │ 8765 │ 2423 │ 2019 │     7 │ 10  │
      └──────┴──────┴──────┴───────┴─────┘

      When using partitioning in HIVE, the output from same source data will be something like:

      ┌──────┬──────┐
      │ val1 │ val2 │
      ├──────┼──────┤
      │ 3673 │ 2345 │
      │ 2345 │ 3423 │
      │ 8765 │ 2423 │
      └──────┴──────┘

      In this case the columns of the partitioning are not present in the CSV files. However output files follows the same folder/path structure as existing today.

      Please considere adding a opt-in config for DataFrameWriter for leaving out the partitioning columns as in the second example.

      The code could be something like:

      Dataset<Row> myDF = spark.createDataFrame(myRDD, MyClass.class);
      myDF.select("value1", "value2", "year","month","day")
      .write().format("csv")
      .option("header", "true")
      .option("partition.omit.cols", "true")
      .partionBy("year","month","day")
      .save("hdfs://user/spark/warehouse/csv_output_dir");

      Thanks.
       

      Attachments

        Activity

          People

            Unassigned Unassigned
            juarezr Juarez Rudsatz
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: