Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-24431

wrong areaUnderPR calculation in BinaryClassificationEvaluator

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Incomplete
    • 2.2.0
    • None
    • ML

    Description

      My problem, I am using CrossValidator(estimator=LogisticRegression(...), ...,  evaluator=BinaryClassificationEvaluator(metricName='areaUnderPR'))  to select best model. when the regParam in logistict regression is very high, no variable will be selected (no model), ie every row 's prediction is same ,eg. equal event rate (baseline frequency). But at this point,  BinaryClassificationEvaluator set the areaUnderPR highest. As a result  best model seleted is a no model. 

      the reason is following.  at time of no model, precision recall curve will be only two points: at recall =0, precision should be set to  zero , while the software set it to 1. at recall=1, precision is the event rate. As a result, the areaUnderPR will be close 0.5 (my even rate is very low), which is maximum .

      the solution is to set precision =0 when recall =0.

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              Ben2018 Xinyong Tian
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: