Details

    • Sub-task
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 3.0.0
    • 3.0.0
    • SQL
    • None

    Description

      Write dates/timestamps to Parquet file in Spark 2.4:

      $ export TZ="UTC"
      $ ~/spark-2.4/bin/spark-shell
      Welcome to
            ____              __
           / __/__  ___ _____/ /__
          _\ \/ _ \/ _ `/ __/  '_/
         /___/ .__/\_,_/_/ /_/\_\   version 2.4.5
            /_/
      
      Using Scala version 2.11.12 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_231)
      Type in expressions to have them evaluated.
      Type :help for more information.
      
      scala> spark.conf.set("spark.sql.session.timeZone", "UTC")
      
      scala> val df = Seq(("1001-01-01", "1001-01-01 01:02:03.123456")).toDF("dateS", "tsS").select($"dateS".cast("date").as("d"), $"tsS".cast("timestamp").as("ts"))
      df: org.apache.spark.sql.DataFrame = [d: date, ts: timestamp]
      
      scala> df.write.parquet("/Users/maxim/tmp/before_1582/2_4_5_micros")
      
      scala> spark.conf.set("spark.sql.parquet.outputTimestampType", "TIMESTAMP_MICROS")
      
      scala> df.write.parquet("/Users/maxim/tmp/before_1582/2_4_5_micros")
      scala> spark.read.parquet("/Users/maxim/tmp/before_1582/2_4_5_micros").show(false)
      +----------+--------------------------+
      |d         |ts                        |
      +----------+--------------------------+
      |1001-01-01|1001-01-01 01:02:03.123456|
      +----------+--------------------------+
      

      Spark 2.4 saves dates/timestamps in Julian calendar. The parquet-mr tool prints 1001-01-07 and 1001-01-07T01:02:03.123456+0000:

      $ java -jar /Users/maxim/proj/parquet-mr/parquet-tools/target/parquet-tools-1.12.0-SNAPSHOT.jar dump -m ./2_4_5_micros/part-00000-fe310bfa-0f61-44af-85ee-489721042c14-c000.snappy.parquet
      INT32 d
      --------------------------------------------------------------------------------
      *** row group 1 of 1, values 1 to 1 ***
      value 1: R:0 D:1 V:1001-01-07
      
      INT64 ts
      --------------------------------------------------------------------------------
      *** row group 1 of 1, values 1 to 1 ***
      value 1: R:0 D:1 V:1001-01-07T01:02:03.123456+0000
      

      Spark 3.0.0-preview2 ( and 3.1.0-SNAPSHOT) prints the same as parquet-mr but different values from Spark 2.4:

      Welcome to
            ____              __
           / __/__  ___ _____/ /__
          _\ \/ _ \/ _ `/ __/  '_/
         /___/ .__/\_,_/_/ /_/\_\   version 3.0.0-preview2
            /_/
      
      Using Scala version 2.12.10 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_231)
      scala> spark.read.parquet("/Users/maxim/tmp/before_1582/2_4_5_micros").show(false)
      +----------+--------------------------+
      |d         |ts                        |
      +----------+--------------------------+
      |1001-01-07|1001-01-07 01:02:03.123456|
      +----------+--------------------------+
      

      Attachments

        Issue Links

          Activity

            People

              maxgekk Max Gekk
              maxgekk Max Gekk
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: