Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-26787

Fix standardization error message in WeightedLeastSquares

    XMLWordPrintableJSON

    Details

    • Type: Documentation
    • Status: Resolved
    • Priority: Trivial
    • Resolution: Fixed
    • Affects Version/s: 2.3.0, 2.3.1, 2.4.0
    • Fix Version/s: 3.0.0
    • Component/s: MLlib
    • Labels:
      None
    • Environment:

      Tested in Spark 2.4.0 on DataBricks running in 5.1 ML Beta.

       

      Description

      There is an error message in WeightedLeastSquares.scala that is incorrect and thus not very helpful for diagnosing an issue. The problem arises when doing regularized LinearRegression on a constant label. Even when the parameter standardization=False, the error will falsely state that standardization was set to True:

      The standard deviation of the label is zero. Model cannot be regularized with standardization=true

      This is because under the hood, LinearRegression automatically sets a parameter standardizeLabel=True. This was chosen for consistency with GLMNet, although WeightedLeastSquares is written to allow standardizeLabel to be set either way and work (although the public LinearRegression API does not allow it).

       

      I will submit a pull request with my suggested wording.

       

      Relevant:

      https://github.com/apache/spark/pull/10702

      https://github.com/apache/spark/pull/10274/commits/d591989f7383b713110750f80b2720bcf24814b5 

       

      The following Python code will replicate the error. 

      import pandas as pd
      from pyspark.ml.feature import VectorAssembler
      from pyspark.ml.regression import LinearRegression
      
      df = pd.DataFrame({'foo': [1,2,3], 'bar':[4,5,6],'label':[1,1,1]})
      spark_df = spark.createDataFrame(df)
      
      vectorAssembler = VectorAssembler(inputCols = ['foo', 'bar'], outputCol = 'features')
      train_sdf = vectorAssembler.transform(spark_df).select(['features', 'label'])
      
      lr = LinearRegression(featuresCol='features', labelCol='label', fitIntercept=False, standardization=False, regParam=1e-4)
      
      lr_model = lr.fit(train_sdf)
      

       

      For context, the reason someone might want to do this is if they are trying to fit a model to estimate components of a fixed total. The label indicates the total is always 100%, but the components vary. For example, trying to estimate the unknown weights of different quantities of substances in a series of full bins. 

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                bscan Brian Scannell
                Reporter:
                bscan Brian Scannell
              • Votes:
                0 Vote for this issue
                Watchers:
                1 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: