Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-25147

GroupedData.apply pandas_udf crashing

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Cannot Reproduce
    • 2.3.1
    • None
    • PySpark
    • None
    • OS: Mac OS 10.13.6

      Python: 2.7.15, 3.6.6

      PyArrow: 0.10.0

      Pandas: 0.23.4

      Numpy: 1.15.0

    Description

      Running the following example taken straight from the docs results in org.apache.spark.SparkException: Python worker exited unexpectedly (crashed) for reasons that aren't clear from any logs I can see:

      from pyspark.sql import SparkSession
      from pyspark.sql import functions as F
      
      
      spark = (
          SparkSession
          .builder
          .appName("pandas_udf")
          .getOrCreate()
      )
      
      df = spark.createDataFrame(
          [(1, 1.0), (1, 2.0), (2, 3.0), (2, 5.0), (2, 10.0)],
          ("id", "v")
      )
      
      @F.pandas_udf("id long, v double", F.PandasUDFType.GROUPED_MAP)
      def normalize(pdf):
          v = pdf.v
          return pdf.assign(v=(v - v.mean()) / v.std())
      
      (
          df
          .groupby("id")
          .apply(normalize)
          .show()
      )
      

       See output.log for stacktrace

       

       

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              msukmanowsky Mike Sukmanowsky
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: