Details
-
Bug
-
Status: Closed
-
Minor
-
Resolution: Not A Bug
-
2.4.0
-
None
-
None
-
Any
Description
Hi,
I had this problem in "real" environments and also made a self-contained test ( [^Test.scala] attached).
Having this Window definition:
val myWindow = Window.partitionBy($"word").orderBy("word") val filt2 = filtrador.withColumn("avg_Time", collect_list($"number").over(myWindow))
In the test I can see how all elements of my DF are being collected in a single task.
Unbounded+unordered Window + collect_list seems to be collecting ALL the dataframe in a single executor/task.
groupBy + collect_list seems to do it as expect (collect_list for each group independently).
At the end it was a data distribution problem caused by one table which was supossed to do 1-1 join and did 1-N join (!!)