Details
-
Improvement
-
Status: Resolved
-
Minor
-
Resolution: Incomplete
-
2.1.0
-
None
Description
Currently, while optimizing a query plan, spark pushes filters down the query plan tree, so that
Join Inner, (a = b) +- Filter UDF(a) +- Relation A +- Relation B
becomes
Join Inner, (a = b) +- Filter UDF(a) +- Relation A +- Filter UDF(b) +- Relation B
In general, it is a good thing to push down filters as it reduces the number of records that will go through the join.
However, in the case where the filter is an user-defined function (UDF), we cannot know if the cost of executing the function twice will be higher than the eventual cost of joining more elements or not.
So I think that the optimizer shouldn't move the user-defined function in the query plan tree. The user will still be able to duplicate the function if he wants to.
See this question on stackoverflow: https://stackoverflow.com/questions/44291078/how-to-tune-the-query-planner-and-turn-off-an-optimization-in-spark