Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-30961

Arrow enabled: to_pandas with date column fails

Attach filesAttach ScreenshotVotersWatch issueWatchersCreate sub-taskLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Won't Fix
    • 2.4.5
    • None
    • PySpark
    • Apache Spark 2.4.5

    Description

      Hi,

      there seems to be a bug in the arrow enabled to_pandas conversion from spark dataframe to pandas dataframe when the dataframe has a column of type DateType. Here is a minimal example to reproduce the issue:

      spark = SparkSession.builder.getOrCreate()
      is_arrow_enabled = spark.conf.get("spark.sql.execution.arrow.enabled")
      print("Arrow optimization is enabled: " + is_arrow_enabled)
      spark_df = spark.createDataFrame(
          [['2019-12-06']], 'created_at: string') \
          .withColumn('created_at', F.to_date('created_at'))
      
      # works
      spark_df.toPandas()
      
      spark.conf.set("spark.sql.execution.arrow.enabled", 'true')
      is_arrow_enabled = spark.conf.get("spark.sql.execution.arrow.enabled")
      print("Arrow optimization is enabled: " + is_arrow_enabled)
      # raises AttributeError: Can only use .dt accessor with datetimelike values
      # series is still of type object, .dt does not exist
      spark_df.toPandas()

      A fix would be to modify the _check_series_convert_date function in pyspark.sql.types to:

      def _check_series_convert_date(series, data_type):
          """
          Cast the series to datetime.date if it's a date type, otherwise returns the original series.    :param series: pandas.Series
          :param data_type: a Spark data type for the series
          """
          from pyspark.sql.utils import require_minimum_pandas_version
          require_minimum_pandas_version()    from pandas import to_datetime
          if type(data_type) == DateType:
              return to_datetime(series).dt.date
          else:
              return series
      

      Let me know if I should prepare a Pull Request for the 2.4.5 branch.

      I have not tested the behavior on master branch.

       

      Thanks,

      Nicolas

      Attachments

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            Unassigned Unassigned
            nicornk Nicolas Renkamp
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Slack

                Issue deployment