Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-47024

Sum of floats/doubles may be incorrect depending on partitioning

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Not A Problem
    • 3.4.2, 3.5.0, 3.3.4
    • None
    • SQL

    Description

      I found this problem using Hypothesis.

      Here's a reproduction that fails on master, 3.5.0, 3.4.2, and 3.3.4 (and probably all prior versions as well):

      from pyspark.sql import SparkSession
      from pyspark.sql.functions import col, sum
      
      SUM_EXAMPLE = [
          (1.0,),
          (0.0,),
          (1.0,),
          (9007199254740992.0,),
      ]
      
      spark = (
          SparkSession.builder
          .config("spark.log.level", "ERROR")
          .getOrCreate()
      )
      
      
      def compare_sums(data, num_partitions):
          df = spark.createDataFrame(data, "val double").coalesce(1)
          result1 = df.agg(sum(col("val"))).collect()[0][0]
          df = spark.createDataFrame(data, "val double").repartition(num_partitions)
          result2 = df.agg(sum(col("val"))).collect()[0][0]
          assert result1 == result2, f"{result1}, {result2}"
      
      
      if __name__ == "__main__":
          print(compare_sums(SUM_EXAMPLE, 2))
      

      This fails as follows:

      AssertionError: 9007199254740994.0, 9007199254740992.0
      

      I suspected some kind of problem related to code generation, so tried setting all of these to false:

      • spark.sql.codegen.wholeStage
      • spark.sql.codegen.aggregate.map.twolevel.enabled
      • spark.sql.codegen.aggregate.splitAggregateFunc.enabled

      But this did not change the behavior.

      Somehow, the partitioning of the data affects the computed sum.

      Attachments

        Activity

          People

            Unassigned Unassigned
            nchammas Nicholas Chammas
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: