Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-32846

Support createDataFrame from an RDD of pd.DataFrames

    XMLWordPrintableJSON

Details

    • Patch

    Description

      Add support to createDataFrame from a distributed collection of pandas.DataFrames by converting the RDD of pd.DFs to an RDD of arrow records batches, then directly creating the spark DataFrame from it.

       

      Performance is significantly better (vectorized) than creating a spark DF by converting each df to a list of rows, similar to the improvement of SPARK-20791.

       

      Initial example & benchmark for older spark versions: https://gist.github.com/linar-jether/7dd61ed6fa89098ab9c58a1ab428b2b5

       

      I'm currently working on a PR and will post it soon.

       

      Extends the work done in: 

      https://issues.apache.org/jira/browse/SPARK-20791 

      https://issues.apache.org/jira/browse/SPARK-23030 

       

      Attachments

        Activity

          People

            Unassigned Unassigned
            lsavion Linar Savion
            Votes:
            1 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated: