[SPARK-26787] Fix standardization error message in WeightedLeastSquares - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Documentation
Status: Resolved
Priority: Trivial
Resolution: Fixed
Affects Version/s: 2.3.0, 2.3.1, 2.4.0
Fix Version/s: 3.0.0
Component/s: MLlib
Labels:
None
Environment:

Tested in Spark 2.4.0 on DataBricks running in 5.1 ML Beta.

Description

There is an error message in WeightedLeastSquares.scala that is incorrect and thus not very helpful for diagnosing an issue. The problem arises when doing regularized LinearRegression on a constant label. Even when the parameter standardization=False, the error will falsely state that standardization was set to True:

The standard deviation of the label is zero. Model cannot be regularized with standardization=true

This is because under the hood, LinearRegression automatically sets a parameter standardizeLabel=True. This was chosen for consistency with GLMNet, although WeightedLeastSquares is written to allow standardizeLabel to be set either way and work (although the public LinearRegression API does not allow it).

I will submit a pull request with my suggested wording.

Relevant:

https://github.com/apache/spark/pull/10702

https://github.com/apache/spark/pull/10274/commits/d591989f7383b713110750f80b2720bcf24814b5

The following Python code will replicate the error.

import pandas as pd
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.regression import LinearRegression

df = pd.DataFrame({'foo': [1,2,3], 'bar':[4,5,6],'label':[1,1,1]})
spark_df = spark.createDataFrame(df)

vectorAssembler = VectorAssembler(inputCols = ['foo', 'bar'], outputCol = 'features')
train_sdf = vectorAssembler.transform(spark_df).select(['features', 'label'])

lr = LinearRegression(featuresCol='features', labelCol='label', fitIntercept=False, standardization=False, regParam=1e-4)

lr_model = lr.fit(train_sdf)

For context, the reason someone might want to do this is if they are trying to fit a model to estimate components of a fixed total. The label indicates the total is always 100%, but the components vary. For example, trying to estimate the unknown weights of different quantities of substances in a series of full bins.

Attachments

Issue Links

links to

GitHub Pull Request #23705

Activity

People

Assignee:: Brian Scannell

Reporter:: Brian Scannell

Votes:: 0 Vote for this issue

Watchers:: 1 Start watching this issue

Dates

Created:: 30/Jan/19 19:16

Updated:: 01/Feb/19 01:53

Resolved:: 01/Feb/19 01:50