Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-11758

Missing Index column while creating a DataFrame from Pandas

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Minor
    • Resolution: Incomplete
    • 1.5.1
    • None
    • PySpark, SQL
    • Linux Debian, PySpark, in local testing.

    • Patch

    Description

      In PySpark's SQLContext, when it invokes createDataFrame() from a pandas.DataFrame and indicating a 'schema' with StructFields, the function _createFromLocal() converts the pandas.DataFrame but ignoring two points:

      • Index column, because the flag index=False
      • Timestamp's records, because a Date column can't be index and Pandas doesn't converts its records in Timestamp's type.
        So, converting a DataFrame from Pandas to SQL is poor in scenarios with temporal records.

      Doc: http://pandas.pydata.org/pandas-docs/version/0.17.0/generated/pandas.DataFrame.to_records.html

      Affected code:

      def _createFromLocal(self, data, schema):
      """
      Create an RDD for DataFrame from an list or pandas.DataFrame, returns
      the RDD and schema.
      """
      if has_pandas and isinstance(data, pandas.DataFrame):
      if schema is None:
      schema = [str(x) for x in data.columns]
      data = [r.tolist() for r in data.to_records(index=False)] # HERE

      1. ...

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              leferrad Leandro Ferrado
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - 5h
                  5h
                  Remaining:
                  Remaining Estimate - 5h
                  5h
                  Logged:
                  Time Spent - Not Specified
                  Not Specified