Details
-
Bug
-
Status: In Progress
-
Major
-
Resolution: Unresolved
-
3.0.0
-
None
-
None
Description
Reopening SPARK-25244 as it is unresolved as of versions 2.4.6 and 3.0.0.
The setting spark.sql.session.timeZone is respected by PySpark when converting from and to Pandas, as described here. However, when timestamps are converted directly to Pythons datetime objects, its ignored and the systems timezone is used.
This can be checked by the following code snippet
import pyspark.sql spark = (pyspark .sql .SparkSession .builder .master('local[1]') .config("spark.sql.session.timeZone", "UTC") .getOrCreate() ) df = spark.createDataFrame([("2018-06-01 01:00:00",)], ["ts"]) df = df.withColumn("ts", df["ts"].astype("timestamp")) print(df.toPandas().iloc[0,0]) print(df.collect()[0][0])
Which for me prints (the exact result depends on the timezone of your system, mine is Europe/Berlin)
2018-06-01 01:00:00 2018-06-01 03:00:00
Hence, the method toPandas respected the timezone setting (UTC), but the method collect ignored it and converted the timestamp to my systems timezone.
The cause for this behaviour is that the methods toInternal and fromInternal of PySparks TimestampType class don't take into account the setting spark.sql.session.timeZone and use the system timezone.
Attachments
Issue Links
- is a clone of
-
SPARK-25244 [Python] Setting `spark.sql.session.timeZone` only partially respected
- Resolved
- links to