Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-45440

Incorrect summary counts from a CSV file

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Major
    • Resolution: Unresolved
    • 3.5.0
    • None
    • Input/Output
    • Pyspark version 3.5.0 

    Description

      I am using pip-installed Pyspark version 3.5.0 inside the context of an IPython shell. The task is straightforward: take this CSV file of AAPL stock prices and compute the minimum and maximum volume weighted average price for the entire file. 

      My code is here. I've also performed the same computation in DuckDB because I noticed that the results of the Spark code are wrong. 

      Literally, the exact same SQL in DuckDB and in Spark yield different results, and Spark's are wrong. 

      I have never seen this behavior in a Spark release before. I'm very confused by it, and curious if anyone else can replicate this behavior. 

      Attachments

        Activity

          People

            Unassigned Unassigned
            evanv Evan Volgas
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated: