Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-24969

SQL: to_date function can't parse date strings in different locales.

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Incomplete
    • 2.2.1
    • None
    • SQL
    • Bare Spark 2.2.1 installation, on RHEL 6.

    Description

      The locale for org.apache.spark.sql.catalyst.util.DateTimeUtils, that is internally used by to_date SQL function, is set in code to be Locale.US.

      This causes problems parsing a dataset which has dates in a different (italian in this case) language.

      spark.read.format("csv")
                  .option("sep", ";")
                  .csv(logFile)
                  .toDF("DATA", .....)
                  .withColumn("DATA2", to_date(col("DATA"), "yyyy MMM"))
                  .show(10)
      

      Results from example dataset:

      DATA DATA2
      2018 giu null
      2018 mag null
      2018 apr 2018-04-01
      2018 mar 2018-03-01
      2018 feb 2018-02-01
      2018 gen null
      2017 dic null
      2017 nov 2017-11-01
      2017 ott null
      2017 set null

      Expected results: All values converted.

      TEMPORARY WORKAROUND:

      In object org.apache.spark.sql.catalyst.util.DateTimeUtils, replace all instances of Locale.US with Locale.<your locale>

      ADDITIONAL NOTES:

      I can make a pull request available on GitHub.

      Attachments

        Activity

          People

            Unassigned Unassigned
            Pinna Valentino Pinna
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: