Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-31671

Wrong error message in VectorAssembler when column lengths can not be inferred

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Minor
    • Resolution: Fixed
    • 2.4.4
    • 2.4.6, 3.0.0
    • ML
    • None
    • Mac OS  catalina

    Description

      In VectorAssembler when input column lengths can not be inferred and handleInvalid = "keep", it will throw a runtime exception with message like below

      Can not infer column lengths with handleInvalid = "keep". Consider using VectorSizeHint
      |to add metadata for columns: [column1, column2]

      However, even if you set vector size hint for column1, the message remains, and will not change to  [column2] only. This is not consistent with the description in the error message.

      This introduce difficulties when I try to resolve this exception, for I do not know which column required vectorSizeHint. This is especially troublesome when you have a large number of columns to deal with.

      Here is a simple example:

       

      // create a df without vector size
      val df = Seq(
        (Vectors.dense(1.0), Vectors.dense(2.0))
      ).toDF("n1", "n2")
      
      // only set vector size hint for n1 column
      val hintedDf = new VectorSizeHint()
        .setInputCol("n1")
        .setSize(1)
        .transform(df)
      
      // assemble n1, n2
      val output = new VectorAssembler()
        .setInputCols(Array("n1", "n2"))
        .setOutputCol("features")
        .setHandleInvalid("keep")
        .transform(hintedDf)
      
      // because only n1 has vector size, the error message should tell us to set vector size for n2 too
      output.show()
      

      Expected error message:

       

      Can not infer column lengths with handleInvalid = "keep". Consider using VectorSizeHint to add metadata for columns: [n2].
      

      Actual error message:

      Can not infer column lengths with handleInvalid = "keep". Consider using VectorSizeHint to add metadata for columns: [n1, n2].
      

      I change one line in VectorAssembler.scala, so that it can work properly as expected. 

      Attachments

        Activity

          People

            fan31415 YijieFan
            fan31415 YijieFan
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Time Tracking

                Estimated:
                Original Estimate - 72h
                72h
                Remaining:
                Remaining Estimate - 72h
                72h
                Logged:
                Time Spent - Not Specified
                Not Specified