Details
-
Bug
-
Status: Open
-
Major
-
Resolution: Unresolved
-
3.5.0
-
None
-
Pyspark version 3.5.0
Description
I am using pip-installed Pyspark version 3.5.0 inside the context of an IPython shell. The task is straightforward: take this CSV file of AAPL stock prices and compute the minimum and maximum volume weighted average price for the entire file.
My code is here. I've also performed the same computation in DuckDB because I noticed that the results of the Spark code are wrong.
Literally, the exact same SQL in DuckDB and in Spark yield different results, and Spark's are wrong.
I have never seen this behavior in a Spark release before. I'm very confused by it, and curious if anyone else can replicate this behavior.