Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-31500

collect_set() of BinaryType returns duplicate elements

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 2.0.2, 2.1.3, 2.2.3, 2.3.4, 2.4.5
    • 2.4.6
    • SQL

    Description

      The collect_set() aggregate function should produce a set of distinct elements. When the column argument's type is BinayType this is not the case.

       

      Example:

      import org.apache.spark.sql.functions._
      import org.apache.spark.sql.expressions.Window

      case class R(id: String, value: String, bytes: Array[Byte])
      def makeR(id: String, value: String) = R(id, value, value.getBytes)
      val df = Seq(makeR("a", "dog"), makeR("a", "cat"), makeR("a", "cat"), makeR("b", "fish")).toDF()

       

      // In the example below "bytesSet" erroneously has duplicates but "stringSet" does not (as expected).

      df.agg(collect_set('value) as "stringSet", collect_set('bytes) as "byteSet").show(truncate=false)

       

      // The same problem is displayed when using window functions.
      val win = Window.partitionBy('id).rowsBetween(Window.unboundedPreceding, Window.unboundedFollowing)
      val result = df.select(
        collect_set('value).over(win) as "stringSet",
        collect_set('bytes).over(win) as "bytesSet"
      )
      .select('stringSet, 'bytesSet, size('stringSet) as "stringSetSize", size('bytesSet) as "bytesSetSize")
      .show()

      Attachments

        Activity

          People

            planga82 Pablo Langa Blanco
            ewasserman Eric Wasserman
            Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: