Details
-
Bug
-
Status: Resolved
-
Major
-
Resolution: Incomplete
-
2.2.1
-
None
-
Bare Spark 2.2.1 installation, on RHEL 6.
Description
The locale for org.apache.spark.sql.catalyst.util.DateTimeUtils, that is internally used by to_date SQL function, is set in code to be Locale.US.
This causes problems parsing a dataset which has dates in a different (italian in this case) language.
spark.read.format("csv") .option("sep", ";") .csv(logFile) .toDF("DATA", .....) .withColumn("DATA2", to_date(col("DATA"), "yyyy MMM")) .show(10)
Results from example dataset:
DATA | DATA2 |
2018 giu | null |
2018 mag | null |
2018 apr | 2018-04-01 |
2018 mar | 2018-03-01 |
2018 feb | 2018-02-01 |
2018 gen | null |
2017 dic | null |
2017 nov | 2017-11-01 |
2017 ott | null |
2017 set | null |
Expected results: All values converted.
TEMPORARY WORKAROUND:
In object org.apache.spark.sql.catalyst.util.DateTimeUtils, replace all instances of Locale.US with Locale.<your locale>
ADDITIONAL NOTES:
I can make a pull request available on GitHub.