Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-25102

Write Spark version to ORC/Parquet file metadata

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 3.0.0
    • 2.4.6, 3.0.0
    • SQL
    • None

    Description

      Currently, Spark writes Spark version number into Hive Table properties with `spark.sql.create.version`.

      parameters:{
        spark.sql.sources.schema.part.0={
          "type":"struct",
          "fields":[{"name":"a","type":"integer","nullable":true,"metadata":{}}]
        },
        transient_lastDdlTime=1541142761, 
        spark.sql.sources.schema.numParts=1,
        spark.sql.create.version=2.4.0
      }
      

      This issue aims to write Spark versions to ORC/Parquet file metadata with `org.apache.spark.sql.create.version`. It's different from Hive Table property key `spark.sql.create.version`. It seems that we cannot change that for backward compatibility (even in Apache Spark 3.0)

      ORC

      User Metadata:
        org.apache.spark.sql.create.version=3.0.0-SNAPSHOT
      

      PARQUET

      file:        file:/tmp/p/part-00007-9dc415fe-7773-49ba-9c59-4c151e16009a-c000.snappy.parquet
      creator:     parquet-mr version 1.10.0 (build 031a6654009e3b82020012a18434c582bd74c73a)
      extra:       org.apache.spark.sql.create.version = 3.0.0-SNAPSHOT
      extra:       org.apache.spark.sql.parquet.row.metadata = {"type":"struct","fields":[{"name":"id","type":"long","nullable":false,"metadata":{}}]}
      

      Attachments

        Issue Links

          Activity

            People

              dongjoon Dongjoon Hyun
              zi Zoltan Ivanfi
              Votes:
              1 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: