[SPARK-32123] [Python] Setting `spark.sql.session.timeZone` only partially respected - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: In Progress
Priority: Major
Resolution: Unresolved
Affects Version/s: 3.0.0
Fix Version/s: None
Component/s: PySpark
Labels:
None

Description

Reopening ~~SPARK-25244~~ as it is unresolved as of versions 2.4.6 and 3.0.0.

The setting spark.sql.session.timeZone is respected by PySpark when converting from and to Pandas, as described here. However, when timestamps are converted directly to Pythons datetime objects, its ignored and the systems timezone is used.

This can be checked by the following code snippet

import pyspark.sql

spark = (pyspark
         .sql
         .SparkSession
         .builder
         .master('local[1]')
         .config("spark.sql.session.timeZone", "UTC")
         .getOrCreate()
        )

df = spark.createDataFrame([("2018-06-01 01:00:00",)], ["ts"])
df = df.withColumn("ts", df["ts"].astype("timestamp"))

print(df.toPandas().iloc[0,0])
print(df.collect()[0][0])

Which for me prints (the exact result depends on the timezone of your system, mine is Europe/Berlin)

2018-06-01 01:00:00
2018-06-01 03:00:00

Hence, the method toPandas respected the timezone setting (UTC), but the method collect ignored it and converted the timestamp to my systems timezone.

The cause for this behaviour is that the methods toInternal and fromInternal of PySparks TimestampType class don't take into account the setting spark.sql.session.timeZone and use the system timezone.

Attachments

Issue Links

is a clone of

SPARK-25244 [Python] Setting `spark.sql.session.timeZone` only partially respected

Resolved

links to

[Github] Pull Request #28946 (TJX2014)

Activity

People

Assignee:: Unassigned

Reporter:: Toby Harradine

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 28/Jun/20 22:40

Updated:: 29/Jun/20 12:09