Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-31287

groupby().applyInPandas, groupby().cogroup().applyInPandas and mapInPandas should ignore type hints

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 3.1.0
    • 3.0.0
    • PySpark, SQL
    • None

    Description

      Setting type hints in pandas function API should not matter at this moment. However, currently it tries to infer type hint when it's set.

      import pandas as pd
      
      def pandas_plus_one(v: pd.DataFrame) -> pd.DataFrame:
          return v + 1
      
      spark.range(10).groupby('id').applyInPandas(pandas_plus_one, schema="id long").show()
      
      Traceback (most recent call last):
        File "/.../spark/python/pyspark/sql/utils.py", line 98, in deco
          return f(*a, **kw)
        File "/.../spark/python/lib/py4j-0.10.9-src.zip/py4j/protocol.py", line 328, in get_return_value
      py4j.protocol.Py4JJavaError: An error occurred while calling o34.flatMapGroupsInPandas.
      : java.lang.IllegalArgumentException: requirement failed: Must pass a grouped map udf
      	at scala.Predef$.require(Predef.scala:281)
      	at org.apache.spark.sql.RelationalGroupedDataset.flatMapGroupsInPandas(RelationalGroupedDataset.scala:541)
      	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
      	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
      	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
      	at java.lang.reflect.Method.invoke(Method.java:498)
      	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
      	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
      	at py4j.Gateway.invoke(Gateway.java:282)
      	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
      	at py4j.commands.CallCommand.execute(CallCommand.java:79)
      	at py4j.GatewayConnection.run(GatewayConnection.java:238)
      	at java.lang.Thread.run(Thread.java:748)
      
      
      During handling of the above exception, another exception occurred:
      
      Traceback (most recent call last):
        File "<stdin>", line 1, in <module>
        File "/.../spark/python/pyspark/sql/pandas/group_ops.py", line 182, in applyInPandas
          jdf = self._jgd.flatMapGroupsInPandas(udf_column._jc.expr())
        File "/.../spark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 1305, in __call__
        File "/.../spark/python/pyspark/sql/utils.py", line 102, in deco
          raise converted
      pyspark.sql.utils.IllegalArgumentException: requirement failed: Must pass a grouped map udf
      

      Looks groupby().cogroup().applyInPandas and mapInPandas also have the same issues.

      Attachments

        Issue Links

          Activity

            People

              gurwls223 Hyukjin Kwon
              gurwls223 Hyukjin Kwon
              Votes:
              0 Vote for this issue
              Watchers:
              0 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: