Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-22216 Improving PySpark/Pandas interoperability
  3. SPARK-24976

Allow None for Decimal type conversion (specific to PyArrow 0.9.0)

    XMLWordPrintableJSON

Details

    • Sub-task
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 2.4.0
    • 2.3.2, 2.4.0
    • PySpark
    • None

    Description

      See https://jira.apache.org/jira/browse/ARROW-2432

      If we use Arrow 0.9.0, the the test case (None as decimal) failed as below:

      Traceback (most recent call last):
        File "/.../spark/python/pyspark/sql/tests.py", line 4672, in test_vectorized_udf_null_decimal
          self.assertEquals(df.collect(), res.collect())
        File "/.../spark/python/pyspark/sql/dataframe.py", line 533, in collect
          sock_info = self._jdf.collectToPython()
        File "/.../spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1257, in __call__
          answer, self.gateway_client, self.target_id, self.name)
        File "/.../spark/python/pyspark/sql/utils.py", line 63, in deco
          return f(*a, **kw)
        File "/.../spark/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py", line 328, in get_return_value
          format(target_id, ".", name), value)
      Py4JJavaError: An error occurred while calling o51.collectToPython.
      : org.apache.spark.SparkException: Job aborted due to stage failure: Task 3 in stage 1.0 failed 1 times, most recent failure: Lost task 3.0 in stage 1.0 (TID 7, localhost, executor driver): org.apache.spark.api.python.PythonException: Traceback (most recent call last):
        File "/.../spark/python/pyspark/worker.py", line 320, in main
          process()
        File "/.../spark/python/pyspark/worker.py", line 315, in process
          serializer.dump_stream(func(split_index, iterator), outfile)
        File "/.../spark/python/pyspark/serializers.py", line 274, in dump_stream
          batch = _create_batch(series, self._timezone)
        File "/.../spark/python/pyspark/serializers.py", line 243, in _create_batch
          arrs = [create_array(s, t) for s, t in series]
        File "/.../spark/python/pyspark/serializers.py", line 241, in create_array
          return pa.Array.from_pandas(s, mask=mask, type=t)
        File "array.pxi", line 383, in pyarrow.lib.Array.from_pandas
        File "array.pxi", line 177, in pyarrow.lib.array
        File "error.pxi", line 77, in pyarrow.lib.check_status
        File "error.pxi", line 77, in pyarrow.lib.check_status
      ArrowInvalid: Error converting from Python objects to Decimal: Got Python object of type NoneType but can only handle these types: decimal.Decimal
      

      Attachments

        Activity

          People

            gurwls223 Hyukjin Kwon
            gurwls223 Hyukjin Kwon
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: