Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-26172

Unify String Params' case-insensitivity in ML

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Resolved
    • Major
    • Resolution: Won't Fix
    • 3.0.0
    • None
    • ML
    • None

    Description

      For now, there are three ways to deal with case-insensitivity in ML:

      1, support case-insensitivity, e.g. LogisticRegression;

      2, support case-insensitivity, but with getter returning the lower case value (not the value passed to setter), e.g. ALS,DecisionTreeClassifier;

      3, do not support case-insensitivity, e.g. NaiveBayes

       

      This situation result in confusion in usage. 

      I think we should choose the first way to support case-insensitivity of all non-columnName string params, including:

      • LogisticRegression: family
      • MultilayerPerceptronClassifier: solver
      • NaiveBayes: modelType
      • DecisionTreeClassifier: impurity
      • RandomForestClassifier: featureSubsetStrategy, impurity
      • GBTClassifier: featureSubsetStrategy, impurity, lossType
      • {{}}
      • LinearRegression: solver, loss
      • GeneralizedLinearRegression: family, link, solver
      • DecisionTreeRegressor: impurity
      • RandomForestRegressor: featureSubsetStrategy, impurity
      • GBTRegressor: featureSubsetStrategy, impurity, lossType
      • {{}}
      • {{KMeans: }}initMode
      • LDA: optimizer
      • PowerIterationClustering{{: }}initMode
      • ALS: coldStartStrategy, intermediateStorageLevel, finalStorageLevel
      • Bucketizer: handleInvalid
      • ChiSqSelector: selectorType
      • Imputer: strategy
      • QuantileDiscretizer: handleInvalid
      • RFormula: handleInvalid, stringIndexerOrderType
      • StringIndexer: handleInvalid, stringOrderType
      • VectorAssembler: handleInvalid
      • VectorIndexer: handleInvalid
      • VectorSizeHint: handleInvalid
      • OneHotEncoderEstimator: handleInvalid (this will be let alone until the breaking change)
      • BinaryClassificationEvaluator: metricName
      • MulticlassClassificationEvaluator: metricName
      • RegressionEvaluator: metricName
      • ClusteringEvaluator: metricName, distanceMeasure

       

       

       

      To to this:

      • methods lowerCaseInArray and upperCaseInArray are created in ParamValidators to check case-insensitivity;
      • methods  {{$$(param: Param[String])}} and %%(param: Param[String]) are created in trait Params to lower/upper the param value conveniently, and this can minimize the modifications in existing codes, since in many cases we only need to change $(param) to $${param};
      • in SharedParamsCodeGen, handleInvalid and distanceMeasure are updated to use  lowerCaseInArray

       

      Attachments

        Activity

          People

            Unassigned Unassigned
            podongfeng Ruifeng Zheng
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: