Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-23929

pandas_udf schema mapped by position and not by name

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Duplicate
    • 2.3.0
    • None
    • PySpark
    • None
    • PySpark

      Spark 2.3.0

       

    Description

      The return struct of a pandas_udf should be mapped to the provided schema by name. Currently it's not the case.

      Consider these two examples, where the only change is the order of the fields in the provided schema struct:

      from pyspark.sql.functions import pandas_udf, PandasUDFType
      df = spark.createDataFrame(
          [(1, 1.0), (1, 2.0), (2, 3.0), (2, 5.0), (2, 10.0)],
          ("id", "v"))  
      @pandas_udf("v double,id long", PandasUDFType.GROUPED_MAP)  
      def normalize(pdf):
          v = pdf.v
          return pdf.assign(v=(v - v.mean()) / v.std())
      df.groupby("id").apply(normalize).show() 
      

      and this one:

      from pyspark.sql.functions import pandas_udf, PandasUDFType
      df = spark.createDataFrame(
          [(1, 1.0), (1, 2.0), (2, 3.0), (2, 5.0), (2, 10.0)],
          ("id", "v"))  
      @pandas_udf("id long,v double", PandasUDFType.GROUPED_MAP)  
      def normalize(pdf):
          v = pdf.v
          return pdf.assign(v=(v - v.mean()) / v.std())
      df.groupby("id").apply(normalize).show()
      

      The results should be the same but they are different:

      For the first code:

      +---+---+
      |  v| id|
      +---+---+
      |1.0|  0|
      |1.0|  0|
      |2.0|  0|
      |2.0|  0|
      |2.0|  1|
      +---+---+
      

      For the second code:

      +---+-------------------+
      | id|                  v|
      +---+-------------------+
      |  1|-0.7071067811865475|
      |  1| 0.7071067811865475|
      |  2|-0.8320502943378437|
      |  2|-0.2773500981126146|
      |  2| 1.1094003924504583|
      +---+-------------------+
      
      
      

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              omri374 Omri
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: