Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-20939

Do not duplicate user-defined functions while optimizing logical query plans

    XMLWordPrintableJSON

Details

    Description

      Currently, while optimizing a query plan, spark pushes filters down the query plan tree, so that

      LogicalPlan
      Join Inner, (a = b)
          +- Filter UDF(a)
              +- Relation A
          +- Relation B
      

      becomes

      Optimized LogicalPlan
       Join Inner, (a = b)
           +- Filter UDF(a)
               +- Relation A
           +- Filter UDF(b)
               +- Relation B
      

      In general, it is a good thing to push down filters as it reduces the number of records that will go through the join.

      However, in the case where the filter is an user-defined function (UDF), we cannot know if the cost of executing the function twice will be higher than the eventual cost of joining more elements or not.

      So I think that the optimizer shouldn't move the user-defined function in the query plan tree. The user will still be able to duplicate the function if he wants to.

      See this question on stackoverflow: https://stackoverflow.com/questions/44291078/how-to-tune-the-query-planner-and-turn-off-an-optimization-in-spark

      Attachments

        Activity

          People

            Unassigned Unassigned
            lovasoa Lovasoa
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: