[SPARK-31500] collect_set() of BinaryType returns duplicate elements - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 2.0.2, 2.1.3, 2.2.3, 2.3.4, 2.4.5
Fix Version/s: 2.4.6
Component/s: SQL
Labels:
- correctness

Description

The collect_set() aggregate function should produce a set of distinct elements. When the column argument's type is BinayType this is not the case.

Example:

import org.apache.spark.sql.functions._
import org.apache.spark.sql.expressions.Window

case class R(id: String, value: String, bytes: Array[Byte])
def makeR(id: String, value: String) = R(id, value, value.getBytes)
val df = Seq(makeR("a", "dog"), makeR("a", "cat"), makeR("a", "cat"), makeR("b", "fish")).toDF()

// In the example below "bytesSet" erroneously has duplicates but "stringSet" does not (as expected).

df.agg(collect_set('value) as "stringSet", collect_set('bytes) as "byteSet").show(truncate=false)

// The same problem is displayed when using window functions.
val win = Window.partitionBy('id).rowsBetween(Window.unboundedPreceding, Window.unboundedFollowing)
val result = df.select(
collect_set('value).over(win) as "stringSet",
collect_set('bytes).over(win) as "bytesSet"
)
.select('stringSet, 'bytesSet, size('stringSet) as "stringSetSize", size('bytesSet) as "bytesSetSize")
.show()

Attachments

Issue Links

links to

[Github] Pull Request #28351 (planga82)

Activity

People

Assignee:: Pablo Langa Blanco

Reporter:: Eric Wasserman

Votes:: 0 Vote for this issue

Watchers:: 5 Start watching this issue

Dates

Created:: 20/Apr/20 19:16

Updated:: 01/May/20 13:17

Resolved:: 01/May/20 13:17