Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-28140

Pyspark API to create spark.mllib RowMatrix from DataFrame

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Resolved
    • Minor
    • Resolution: Fixed
    • 3.0.0
    • 3.0.0
    • MLlib, PySpark
    • None

    Description

      Since many functions are only implemented in spark.mllib, it is often necessary to convert DataFrames of spark.ml vectors to spark.mllib distributed matrix formats. The first step, converting the spark.ml vectors to the spark.mllib equivalent, is straightforward. However, to the best of my knowledge it's not possible to convert the resulting DataFrame to a RowMatrix without using a python lambda function, which can have a significant performance hit. In my recent use case, SVD took 3.5m using the Scala API, but 12m using Python.

      To get around this performance hit, I propose adding a constructor to the Pyspark RowMatrix class that accepts a DataFrame with a single column of spark.mllib vectors. I'd be happy to add an equivalent API for IndexedRowMatrix if there is demand.

      Attachments

        Issue Links

          Activity

            People

              hhd Henry Davidge
              hhd Henry Davidge
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: