Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-46794

Incorrect results due to inferred predicate from checkpoint with subquery

    XMLWordPrintableJSON

Details

    Description

      Spark can produce incorrect results when using a checkpointed DataFrame with a filter containing a scalar subquery. This subquery is included in the constraints of the resulting LogicalRDD, and may then be propagated as a filter when joining with the checkpointed DataFrame. This causes the subquery to be evaluated twice: once during checkpointing and once while evaluating the query. These two subquery evaluations may return different results, e.g. when the subquery contains a limit with an underspecified sort order.

      Attachments

        Issue Links

          Activity

            People

              tomvanbussel Tom van Bussel
              tomvanbussel Tom van Bussel
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: