[SPARK-26166] CrossValidator.fit() bug,training and validation dataset may overlap - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Not A Problem
Affects Version/s: 2.3.0
Fix Version/s: None
Component/s: ML
Labels:
None

Description

In the code pyspark.ml.tuning.CrossValidator.fit(), after adding random column

df = dataset.select("*", rand(seed).alias(randCol))

Should add

df.checkpoint()

If df is not checkpointed, it will be recomputed each time when train and validation dataframe need to be created. The order of rows in df,which rand(seed) is dependent on, is not deterministic . Thus each time random column value could be different for a specific row even with seed. Note , checkpoint() can not be replaced with cached(), because when a node fails, cached table need be recomputed, thus random number could be different.

This might especially be a problem when input 'dataset' dataframe is resulted from a query including 'where' clause. see below.

https://dzone.com/articles/non-deterministic-order-for-select-with-limit

Attachments

Activity

People

Assignee:: Unassigned

Reporter:: Xinyong Tian

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 25/Nov/18 20:06

Updated:: 26/Oct/19 23:47

Resolved:: 26/Oct/19 23:47