[SPARK-45440] Incorrect summary counts from a CSV file - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Open
Priority: Major
Resolution: Unresolved
Affects Version/s: 3.5.0
Fix Version/s: None
Component/s: Input/Output
Labels:
- aggregation
- bug
- pyspark
Environment:

Pyspark version 3.5.0

Description

I am using pip-installed Pyspark version 3.5.0 inside the context of an IPython shell. The task is straightforward: take this CSV file of AAPL stock prices and compute the minimum and maximum volume weighted average price for the entire file.

My code is here. I've also performed the same computation in DuckDB because I noticed that the results of the Spark code are wrong.

Literally, the exact same SQL in DuckDB and in Spark yield different results, and Spark's are wrong.

I have never seen this behavior in a Spark release before. I'm very confused by it, and curious if anyone else can replicate this behavior.

Attachments

Activity

People

Assignee:: Unassigned

Reporter:: Evan Volgas

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 06/Oct/23 20:51

Updated:: 07/Oct/23 00:02