Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-29265

Window+collect_list causing single-task operation

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Minor
    • Resolution: Not A Bug
    • 2.4.0
    • None
    • Spark Core
    • None
    • Any

    Description

      Hi,

       

      I had this problem in "real" environments and also made a self-contained test ( [^Test.scala] attached).

      Having this Window definition:

      val myWindow = Window.partitionBy($"word").orderBy("word") 
      val filt2 = filtrador.withColumn("avg_Time", collect_list($"number").over(myWindow))
      

        

      In the test I can see how all elements of my DF are being collected in a single task.

      Unbounded+unordered Window + collect_list seems to be collecting ALL the dataframe in a single executor/task.

      groupBy + collect_list seems to do it as expect (collect_list for each group independently).
       
      At the end it was a data distribution problem caused by one table which was supossed to do 1-1 join and did 1-N join (!!)

      Attachments

        Activity

          People

            Unassigned Unassigned
            FSainz Florentino Sainz
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: