Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-25124

VectorSizeHint.size is buggy, breaking streaming pipeline

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 2.3.1
    • 2.3.2, 2.4.0
    • ML

    Description

      Currently, when using VectorSizeHint().setSize(3) in an ML pipeline, transforming a stream will return a nondescript exception about the stream not started. At core are the following bugs that setSize and getSize do not return values but None:

      https://github.com/apache/spark/blob/master/python/pyspark/ml/feature.py#L3846

      How to reproduce, using the example in the doc:

      from pyspark.ml.linalg import Vectors
      from pyspark.ml import Pipeline, PipelineModel
      from pyspark.ml.feature import VectorAssembler, VectorSizeHint
      data = [(Vectors.dense([1., 2., 3.]), 4.)]
      df = spark.createDataFrame(data, ["vector", "float"])
      sizeHint = VectorSizeHint(inputCol="vector", handleInvalid="skip").setSize(3) # Will fail
      vecAssembler = VectorAssembler(inputCols=["vector", "float"], outputCol="assembled")
      pipeline = Pipeline(stages=[sizeHint, vecAssembler])
      pipelineModel = pipeline.fit(df)
      pipelineModel.transform(df).head().assembled
      

      Attachments

        Activity

          People

            huaxingao Huaxin Gao
            timhunter Timothy Hunter
            Joseph K. Bradley Joseph K. Bradley
            Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: