Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-26166

CrossValidator.fit() bug,training and validation dataset may overlap

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Not A Problem
    • 2.3.0
    • None
    • ML
    • None

    Description

      In the code pyspark.ml.tuning.CrossValidator.fit(), after adding random column

      df = dataset.select("*", rand(seed).alias(randCol))

      Should add

      df.checkpoint()

      If  df is  not checkpointed, it will be recomputed each time when train and validation dataframe need to be created. The order of rows in df,which rand(seed)  is dependent on, is not deterministic . Thus each time random column value could be different for a specific row even with seed. Note , checkpoint() can not be replaced with cached(), because when a node fails, cached table need be  recomputed, thus random number could be different.

      This might especially  be a problem when input 'dataset' dataframe is resulted from a query including 'where' clause. see below.

      https://dzone.com/articles/non-deterministic-order-for-select-with-limit

       

       

       

      Attachments

        Activity

          People

            Unassigned Unassigned
            Ben2018 Xinyong Tian
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: