[SPARK-29327] Support specifying features via multiple columns in Predictor and PredictionModel - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Resolved
Priority: Major
Resolution: Won't Fix
Affects Version/s: 3.0.0
Fix Version/s: None
Component/s: ML, MLlib
Labels:
- pull-request-available

Description

There are always more features than one in a classification/regression task, however the current API to specify features columns in Predictor of Spark MLLib only supports one single column, which requires users to assemble the multiple features columns into a "org.apache.spark.ml.linalg.Vector" before fitting to Spark ML pipeline.

This improvement is going to let users specify the features columns directly without vectorization. To support this, we can introduce two new APIs in both "Predictor" and "PredictionModel", and a new parameter named "featuresCols" storing the features columns names as an Array. ( PR is ready here https://github.com/apache/spark/pull/25983)
APIs:
def setFeaturesCol(value: Array[String]): M = ...
protected def isSupportMultiColumnsForFeatures: Boolean = false
Parameter:
final val featuresCols: StringArrayParam = new StringArrayParam(this, "featuresCols", ...)

Then ML implementations can get and use the features columns names from this new parameter "featuresCols", along with the raw data of features in separate columns directly in dataset.

Attachments

Issue Links

links to

GitHub Pull Request #25983

Activity

People

Assignee:: Unassigned

Reporter:: Liangcai Li

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 02/Oct/19 08:45

Updated:: 30/Nov/19 16:20

Resolved:: 30/Nov/19 16:20