[SPARK-25544] Slow/failed convergence in Spark ML models due to internal predictor scaling - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Incomplete
Affects Version/s: 2.3.2
Fix Version/s: None
Component/s: ML
Labels:
- bulk-closed
Environment:

Databricks runtime 4.2: Spark 2.3.1, Scala 2.11

Description

The LinearRegression and LogisticRegression estimators in Spark ML can take a large number of iterations to converge, or fail to converge altogether, when trained using the l-bfgs method with standardization turned off.

Details:

LinearRegression and LogisticRegression standardize their input features by default. In ~~SPARK-8522~~ the option to disable standardization was added. This is implemented internally by changing the effective strength of regularization rather than disabling the feature scaling. Mathematically, both changing the effective regularizaiton strength, and disabling feature scaling should give the same solution, but they can have very different convergence properties.

The normal justication given for scaling features is that it ensures that all covariances are O(1) and should improve numerical convergence, but this argument does not account for the regularization term. This doesn't cause any issues if standardization is set to true, since all features will have an O(1) regularization strength. But it does cause issues when standardization is set to false, since the effecive regularization strength of feature i is now O(1/ sigma_i^2) where sigma_i is the standard deviation of the feature. This means that predictors with small standard deviations (which can occur legitimately e.g. via one hot encoding) will have very large effective regularization strengths and consequently lead to very large gradients and thus poor convergence in the solver.

Example code to recreate:

To demonstrate just how bad these convergence issues can be, here is a very simple test case which builds a linear regression model with a categorical feature, a numerical feature and their interaction. When fed the specified training data, this model will fail to converge before it hits the maximum iteration limit. In this case, it is the interaction between category "2" and the numeric feature that leads to a feature with a small standard deviation.

Training data:

category	numericFeature	label
1	1.0	0.5
1	0.5	1.0
2	0.01	2.0

val df = Seq(("1", 1.0, 0.5), ("1", 0.5, 1.0), ("2", 1e-2, 2.0)).toDF("category", "numericFeature", "label")

val indexer = new StringIndexer().setInputCol("category") .setOutputCol("categoryIndex")
val encoder = new OneHotEncoder().setInputCol("categoryIndex").setOutputCol("categoryEncoded").setDropLast(false)
val interaction = new Interaction().setInputCols(Array("categoryEncoded", "numericFeature")).setOutputCol("interaction")
val assembler = new VectorAssembler().setInputCols(Array("categoryEncoded", "interaction")).setOutputCol("features")
val model = new LinearRegression().setFeaturesCol("features").setLabelCol("label").setPredictionCol("prediction").setStandardization(false).setSolver("l-bfgs").setRegParam(1.0).setMaxIter(100)
val pipeline = new Pipeline().setStages(Array(indexer, encoder, interaction, assembler, model))

val pipelineModel  = pipeline.fit(df)

val numIterations = pipelineModel.stages(4).asInstanceOf[LinearRegressionModel].summary.totalIterations

Possible fix:

These convergence issues can be fixed by turning off feature scaling when standardization is set to false rather than using an effective regularization strength. This can be hacked into LinearRegression.scala by simply replacing line 423

val featuresStd = featuresSummarizer.variance.toArray.map(math.sqrt)

with

val featuresStd = if ($(standardization)) featuresSummarizer.variance.toArray.map(math.sqrt) else featuresSummarizer.variance.toArray.map(x => 1.0)

Rerunning the above test code with that hack in place, will lead to convergence after just 4 iterations instead of hitting the max iterations limit!

Impact:

I can't speak for other people, but I've personally encountered these convergence issues several times when building production scale Spark ML models, and have resorted to writing my only implementation of LinearRegression with the above hack in place. The issue is made worse by the fact that Spark does not raise an error when the maximum number of iterations is hit, so the first time you encounter the issue it can take a while to figure out what is going on.

Attachments

Activity

People

Assignee:: Unassigned

Reporter:: Andrew Crosby

Votes:: 1 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 26/Sep/18 13:38

Updated:: 25/May/21 01:55

Resolved:: 25/May/21 01:38