Details
-
New Feature
-
Status: Resolved
-
Minor
-
Resolution: Incomplete
-
2.3.1
-
None
Description
If computations in Project are heavy (e.g., UDFs), it is useful to push down sample nodes into deterministic projects;
scala> spark.range(10).selectExpr("id + 3").sample(0.5).explain(true) // without this proposal == Analyzed Logical Plan == (id + 3): bigint Sample 0.0, 0.5, false, 3370873312340343855 +- Project [(id#0L + cast(3 as bigint)) AS (id + 3)#2L] +- Range (0, 10, step=1, splits=Some(4)) == Optimized Logical Plan == Sample 0.0, 0.5, false, 3370873312340343855 +- Project [(id#0L + 3) AS (id + 3)#2L] +- Range (0, 10, step=1, splits=Some(4)) // with this proposal == Optimized Logical Plan == Project [(id#0L + 3) AS (id + 3)#2L] +- Sample 0.0, 0.5, false, -6519017078291024113 +- Range (0, 10, step=1, splits=Some(4))
POC: https://github.com/apache/spark/compare/master...maropu:SamplePushdown