Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-30939

StringIndexer setOutputCols does not set output cols

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 3.0.0
    • 3.0.0
    • ML
    • None

    Description

      (Credit to Brooke Wenig for finding it). Quoting:

      ".. The python code works completely fine, but the scala code is outputting

      strIdx_8278ae6d55b3__output, strIdx_8278ae6d55b3__output, strIdx_8278ae6d55b3__output, strIdx_8278ae6d55b3__output, strIdx_8278ae6d55b3__output, strIdx_8278ae6d55b3__output, strIdx_8278ae6d55b3__output
      

      for the output of the string indexer, instead of using the column names specified in here:

      val stringIndexer = new StringIndexer()
        .setInputCols(categoricalCols)
        .setOutputCols(indexOutputCols)
        .setHandleInvalid("skip")
      

      I was expecting the resulting column names to be

      indexOutputCols: Array[String] = Array(host_is_superhostIndex, cancellation_policyIndex, instant_bookableIndex, neighbourhood_cleansedIndex, property_typeIndex, room_typeIndex, bed_typeIndex)
      

      Indeed I'm pretty sure this is the bug:

        private def validateAndTransformField(
            schema: StructType,
            inputColName: String,
            outputColName: String): StructField = {
          val inputDataType = schema(inputColName).dataType
          require(inputDataType == StringType || inputDataType.isInstanceOf[NumericType],
            s"The input column $inputColName must be either string type or numeric type, " +
              s"but got $inputDataType.")
          require(schema.fields.forall(_.name != outputColName),
            s"Output column $outputColName already exists.")
          NominalAttribute.defaultAttr.withName($(outputCol)).toStructField()
        }
      

      The last line does not use the transformed output col name, but the default single output col parameter.

      Attachments

        Issue Links

          Activity

            People

              srowen Sean R. Owen
              srowen Sean R. Owen
              Votes:
              0 Vote for this issue
              Watchers:
              0 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: