[SPARK-20939] Do not duplicate user-defined functions while optimizing logical query plans - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Resolved
Priority: Minor
Resolution: Incomplete
Affects Version/s: 2.1.0
Fix Version/s: None
Component/s: Optimizer, SQL
Labels:

Description

Currently, while optimizing a query plan, spark pushes filters down the query plan tree, so that

LogicalPlan

Join Inner, (a = b)
    +- Filter UDF(a)
        +- Relation A
    +- Relation B

becomes

Optimized LogicalPlan

 Join Inner, (a = b)
     +- Filter UDF(a)
         +- Relation A
     +- Filter UDF(b)
         +- Relation B

In general, it is a good thing to push down filters as it reduces the number of records that will go through the join.

However, in the case where the filter is an user-defined function (UDF), we cannot know if the cost of executing the function twice will be higher than the eventual cost of joining more elements or not.

So I think that the optimizer shouldn't move the user-defined function in the query plan tree. The user will still be able to duplicate the function if he wants to.

See this question on stackoverflow: https://stackoverflow.com/questions/44291078/how-to-tune-the-query-planner-and-turn-off-an-optimization-in-spark

Attachments

Activity

People

Assignee:: Unassigned

Reporter:: Lovasoa

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 31/May/17 20:21

Updated:: 21/May/19 04:16

Resolved:: 21/May/19 04:16