Description
Execution plans are similar when passing an empty versus non-empty DataFrame to pyspark's subtract call.
df.subtract(regDf)
yields the same physical plan as:
df.subtract(emptyDf)
Since the operation (EXCEPT DISTINCT in Spark SQL) requires a sort on both DataFrames, this can yield some significant performance speed-ups because if the incoming DF is empty no processing should happen.
Should be a quick fix for a seasoned commiter.