Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-17913

Filter/join expressions can return incorrect results when comparing strings to longs

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 1.6.2, 2.0.0
    • 2.2.0
    • SQL

    Description

      Reproducer:

        case class E(subject: Long, predicate: String, objectNode: String)
      
        def test(sc: SparkContext) = {
          val sqlContext: SQLContext = new SQLContext(sc)
          import sqlContext.implicits._
      
          val broken = List(
            (19157170390056969L, "right", 19157170390056969L),
            (19157170390056973L, "wrong", 19157170390056971L),
            (19157190254313477L, "wrong", 19157190254313475L),
            (19157180859056133L, "wrong", 19157180859056131L),
            (19157170390056969L, "number", 161),
            (19157170390056971L, "string", "a string"),
            (19157190254313475L, "string", "another string"),
            (19157180859056131L, "number", 191)
          )
      
          val brokenDF = sc.parallelize(broken).map(b => E(b._1, b._2, b._3.toString)).toDF()
          val brokenFilter = brokenDF.filter($"subject" === $"objectNode")
          val fixed = brokenDF.filter(brokenDF("subject").cast("string") === brokenDF("objectNode"))
      
          println("***** incorrect filter results *****")
          println(brokenFilter.show())
          println("***** correct filter results *****")
          println(fixed.show())
      
          println("***** both sides cast to double *****")
          println(brokenFilter.explain())
        }
      
      Broken filter returns:
      
      +-----------------+---------+-----------------+
      |          subject|predicate|       objectNode|
      +-----------------+---------+-----------------+
      |19157170390056969|    right|19157170390056969|
      |19157170390056973|    wrong|19157170390056971|
      |19157190254313477|    wrong|19157190254313475|
      |19157180859056133|    wrong|19157180859056131|
      +-----------------+---------+-----------------+
      

      The physical plan shows both sides of the expression are being cast to Double before evaluation. So while comparing numbers to a string number appears to work in many cases, when the numbers are sufficiently large and close together there is enough loss of precision to cause incorrect results.

      == Physical Plan ==
      Filter (cast(subject#0L as double) = cast(objectNode#2 as double))
      
      After casting the left side into strings, the filter returns the expected result:
      
      +-----------------+---------+-----------------+
      |          subject|predicate|       objectNode|
      +-----------------+---------+-----------------+
      |19157170390056969|    right|19157170390056969|
      +-----------------+---------+-----------------+
      

      Expected behavior in this case is probably to choose one side and cast the other (compare string to string or long to long) instead of using a data type with less precision.

      Attachments

        Issue Links

          Activity

            People

              cloud_fan Wenchen Fan
              mingbeckwith Ming Beckwith
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: