Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-28978

PySpark: Can't pass more than 256 arguments to a UDF

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 2.3.2, 2.4.0, 2.4.4
    • 3.0.0
    • PySpark

    Description

      This code:

      https://github.com/apache/spark/blob/712874fa0937f0784f47740b127c3bab20da8569/python/pyspark/worker.py#L367-L379

      Creates Python lambdas that call UDF functions passing arguments singly, rather than using varargs.  For example: `lambda a: f(a[0], a[1], ...)`.

      This fails when there are more than 256 arguments.

      mlflow, when generating model predictions, uses an argument for each feature column.  I have a model with > 500 features.

      I was able to easily hack around this by changing the generated lambdas to use varargs, as in `lambda a: f(*a)`. 

      IDK why these lambdas were created the way they were.  Using varargs is much simpler and works fine in my testing.

       

       

      Attachments

        Issue Links

          Activity

            People

              bago.amirbekian Bago Amirbekian
              j1m Jim Fulton
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: