Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-27867

RegressionEvaluator cache lastest RegressionMetrics to avoid duplicated computation

    XMLWordPrintableJSON

    Details

    • Type: Improvement
    • Status: Resolved
    • Priority: Major
    • Resolution: Not A Problem
    • Affects Version/s: 3.0.0
    • Fix Version/s: None
    • Component/s: ML
    • Labels:
      None

      Description

      In most cases, given a model, we have to obtain multi metrics of it.

      For examples, a regression model, we may need to obtain the R2, MAE and MSE.

      However, current design of `Evaluator` do not support computing multi metrics at once.

      In practice, we usually use RegressionEvaluator like this:

      val evaluator = new RegressionEvaluator()
      
      
      val r2 = evaluator.setMetricName("r2").evaluate(df)
      
      
      val mae = evaluator.setMetricName("mae").evaluate(df)
      
      
      val mse = evaluator.setMetricName("mse").evaluate(df)

       

      However, current impl of RegressionEvaluator needs one pass of the whole input dataset to compute one metric. So, above example needs 3 passes.

      This can be optimized since in {RegressionMetrics}  all metrics can be computed at once.

      If we cache the lastest inputs, and then if the next evaluate call keep the inputs (except the metricName), then we can directly obtain the metric from the internal intermediate summary.

       

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                Unassigned
                Reporter:
                podongfeng zhengruifeng
              • Votes:
                0 Vote for this issue
                Watchers:
                1 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: