Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-28140

Pyspark API to create spark.mllib RowMatrix from DataFrame

    XMLWordPrintableJSON

    Details

    • Type: Improvement
    • Status: Resolved
    • Priority: Minor
    • Resolution: Fixed
    • Affects Version/s: 3.0.0
    • Fix Version/s: 3.0.0
    • Component/s: MLlib, PySpark
    • Labels:
      None

      Description

      Since many functions are only implemented in spark.mllib, it is often necessary to convert DataFrames of spark.ml vectors to spark.mllib distributed matrix formats. The first step, converting the spark.ml vectors to the spark.mllib equivalent, is straightforward. However, to the best of my knowledge it's not possible to convert the resulting DataFrame to a RowMatrix without using a python lambda function, which can have a significant performance hit. In my recent use case, SVD took 3.5m using the Scala API, but 12m using Python.

      To get around this performance hit, I propose adding a constructor to the Pyspark RowMatrix class that accepts a DataFrame with a single column of spark.mllib vectors. I'd be happy to add an equivalent API for IndexedRowMatrix if there is demand.

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                hhd Henry Davidge
                Reporter:
                hhd Henry Davidge
              • Votes:
                0 Vote for this issue
                Watchers:
                3 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: