Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-22700

Bucketizer.transform incorrectly drops row containing NaN

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 2.2.0, 2.3.0
    • 2.2.2, 2.3.0
    • ML
    • None

    Description

      import org.apache.spark.ml.feature._
      
      val df = spark.createDataFrame(Seq((2.3, 3.0), (Double.NaN, 3.0), (6.7, Double.NaN))).toDF("a", "b")
      
      val splits = Array(Double.NegativeInfinity, 3.0, Double.PositiveInfinity)
      
      val bucketizer: Bucketizer = new Bucketizer().setInputCol("a").setOutputCol("aa").setSplits(splits)
      
      bucketizer.setHandleInvalid("skip")
      
      scala> df.show
      +---+---+
      |  a|  b|
      +---+---+
      |2.3|3.0|
      |NaN|3.0|
      |6.7|NaN|
      +---+---+
      
      scala> bucketizer.transform(df).show
      +---+---+---+
      |  a|  b| aa|
      +---+---+---+
      |2.3|3.0|0.0|
      +---+---+---+
      
      

      When handleInvalid is set skip, the last item in input is incorrectly droped, though colum 'b' is not an input column

      Attachments

        Activity

          People

            podongfeng Ruifeng Zheng
            podongfeng Ruifeng Zheng
            Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: