Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-27892

Saving/loading stages in PipelineModel should be parallel

    XMLWordPrintableJSON

    Details

    • Type: Improvement
    • Status: Open
    • Priority: Minor
    • Resolution: Unresolved
    • Affects Version/s: 3.0.0
    • Fix Version/s: None
    • Component/s: ML

      Description

      When a PipelineModel is saved/loaded, all the stages are saved/loaded sequentially. When dealing with a PipelineModel with many stages, although each stage's save/load takes sub-second, the total time taken for the PipelineModel could be several minutes. It should be trivial to parallelize the save/load of stages in the SharedReadWrite object.

       

      To reproduce:

      import org.apache.spark.ml._
      import org.apache.spark.ml.feature.VectorAssembler
      val outputPath = "..."
      val stages = (1 to 100) map { i => new VectorAssembler().setInputCols(Array("input")).setOutputCol("o" + i)}
      val p = new Pipeline().setStages(stages.toArray)
      val data = Seq(1, 1, 1) toDF "input"
      val pm = p.fit(data)
      pm.save(outputPath)

        Attachments

          Activity

            People

            • Assignee:
              Unassigned
              Reporter:
              memoryz Jason Wang
            • Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

              • Created:
                Updated: