Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-45910

Numerical output of MulticlassClassificationEvaluator does not coincide with expected output

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Critical
    • Resolution: Unresolved
    • 3.4.1, 3.5.0
    • None
    • ML
    • None

    Description

      To show an example of MulticlassClassificationEvaluator generating a numerical output, which does not coincide with the expected output consider the following code:

      from pyspark.ml.classification import LinearSVC
      from pyspark.ml.feature import VectorAssembler
      from pyspark.ml.evaluation import MulticlassClassificationEvaluator
      
      train_data = [(0, 1.0, 2.0, 3.0), (1, 4.0, 5.0, 6.0), (0, 7.0, 8.0, 9.0)]
      valid_data = [(1, 2.0, 3.0, 4.0), (0, 5.0, 6.0, 7.0), (1, 8.0, 9.0, 10.0)]
      
      schema = ["label", "feature1", "feature2", "feature3"]
      
      train = spark.createDataFrame(train_data, schema=schema)
      valid = spark.createDataFrame(valid_data, schema=schema)
      
      feature_columns = ["feature1", "feature2", "feature3"]
      assembler = VectorAssembler(inputCols=feature_columns, outputCol="features")
      train = assembler.transform(train)
      valid = assembler.transform(valid)
      
      svm = LinearSVC(maxIter=10, regParam=0.1)
      model = svm.fit(train)
      predictions = model.transform(valid)
      
      recallByLabel = MulticlassClassificationEvaluator(metricName="recallByLabel")
      weightedRecall = MulticlassClassificationEvaluator(metricName="weightedRecall")
      
      print(f"Recall by label: {recallByLabel.evaluate(predictions)}")
      print(f"Weighted recall: {weightedRecall.evaluate(predictions)}") 

      It produces:

      Recall by label: 1.0
      Weighted recall: 0.3333333333333333

      but predictions.show() implies the following hand calculated confusion matrix:

       -----------
      |  0  |  0  |
      |  2  |  1  |
       -----------

      where the recall is 0, i.e., 0 / (0 + 2).

      What is the nature of this discrepancy? Also, note that it is not restricted to recall; and other classifiers, which include a probability column in predictions, behave similarly.

       

      Furthermore, the translation of the example to Scala, namely:

      import org.apache.spark.ml.classification.LinearSVC
      import org.apache.spark.ml.feature.VectorAssembler
      import org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator
      import org.apache.spark.sql.DataFrame
      
      val trainData = Seq((0, 1.0, 2.0, 3.0), (1, 4.0, 5.0, 6.0), (0, 7.0, 8.0, 9.0))
      val validData = Seq((1, 2.0, 3.0, 4.0), (0, 5.0, 6.0, 7.0), (1, 8.0, 9.0, 10.0))
      
      val schema = Seq("label", "feature1", "feature2", "feature3")
      
      val train: DataFrame = spark.createDataFrame(trainData).toDF(schema: _*)
      val valid: DataFrame = spark.createDataFrame(validData).toDF(schema: _*)
      
      val featureColumns = Array("feature1", "feature2", "feature3")
      val assembler = new VectorAssembler()
        .setInputCols(featureColumns)
        .setOutputCol("features")
      
      val trainAssembled = assembler.transform(train)
      val validAssembled = assembler.transform(valid)
      
      val svm = new LinearSVC()
        .setMaxIter(10)
        .setRegParam(0.1)
      
      val model = svm.fit(trainAssembled)
      val predictions = model.transform(validAssembled)
      
      val recallByLabel = new MulticlassClassificationEvaluator()
        .setMetricName("recallByLabel")
      val weightedRecall = new MulticlassClassificationEvaluator()
        .setMetricName("weightedRecall")
      
      println(s"Recall by label: ${recallByLabel.evaluate(predictions)}")
      println(s"Weighted recall: ${weightedRecall.evaluate(predictions)}")

      produces the same recall by label and weighted recall, as described above.

      Attachments

        1. predictions_dot_show.png
          55 kB
          Alex Wozniakowski

        Activity

          People

            Unassigned Unassigned
            airwoz Alex Wozniakowski
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated: