Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-27245

Optimizer repeat Python UDF calls

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Duplicate
    • 2.3.1, 2.3.2, 2.4.0
    • None
    • Optimizer, SQL

    Description

      The physical plan proposed by .explain() method shows an inefficient way to call Python UDFs in PySpark.

      This behaviour take place under these circustances:

      • PySpark API
      • At least one operation in the DAG that uses the result of the Python UDF

      My expectation is that the optimizer should call once the Python UDF with BatchEvalPython and then reuse the result across following steps.

      The optimizer prefers to call n times the same UDF, with the same parameters within the same BatchEvalPython, and only uses one of the result columns (PythonUDF2#16) while discarding the others.

      I believe that could lead to poor performances due to the large data exchange with Python processes and due to the additional calls.

      Example code:

      foo_udf = f.udf(lambda x: 1, IntegerType())
      
      df = spark.createDataFrame([['bar']]) \
              .withColumn('result', foo_udf(f.col('_1'))) \
              .withColumn('a', f.col('result')) \
              .withColumn('b', f.col('result'))
      
      df.explain()
      
      == Physical Plan ==
      *(1) Project [_1#0, pythonUDF2#16 AS result#2, pythonUDF2#16 AS a#5, pythonUDF2#16 AS b#9]
      +- BatchEvalPython [<lambda>(_1#0), <lambda>(_1#0), <lambda>(_1#0)], [_1#0, pythonUDF0#14, pythonUDF1#15, pythonUDF2#16]
         +- Scan ExistingRDD[_1#0]
      

      Full code on Gist: https://gist.github.com/andrearota/f77b6a293421a3f26dd5d2fb0a04046e

       

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              arota Andrea Rota
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: