Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-27867

RegressionEvaluator cache lastest RegressionMetrics to avoid duplicated computation

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Resolved
    • Major
    • Resolution: Not A Problem
    • 3.0.0
    • None
    • ML
    • None

    Description

      In most cases, given a model, we have to obtain multi metrics of it.

      For examples, a regression model, we may need to obtain the R2, MAE and MSE.

      However, current design of `Evaluator` do not support computing multi metrics at once.

      In practice, we usually use RegressionEvaluator like this:

      val evaluator = new RegressionEvaluator()
      
      
      val r2 = evaluator.setMetricName("r2").evaluate(df)
      
      
      val mae = evaluator.setMetricName("mae").evaluate(df)
      
      
      val mse = evaluator.setMetricName("mse").evaluate(df)

       

      However, current impl of RegressionEvaluator needs one pass of the whole input dataset to compute one metric. So, above example needs 3 passes.

      This can be optimized since in {RegressionMetrics}  all metrics can be computed at once.

      If we cache the lastest inputs, and then if the next evaluate call keep the inputs (except the metricName), then we can directly obtain the metric from the internal intermediate summary.

       

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              podongfeng Ruifeng Zheng
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: