Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-5917

Distinct is broken

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Duplicate
    • 1.1.1
    • None
    • Spark Core
    • None
    • Spark 1.1.1 running on YARN 2.4 via Amazon EMR.

    Description

      I hate to file bugs that are hard to reproduce (by other people), but after spending a full week trying to debug my code, I constructed a scenario where the following assertion FAILS.

      val x : RDD[T] = ....
      val y = x.distinct()
      assert( y.count() <= x.count() )

      I am at a complete loss as to how this can occur under ANY definition of equality/order unless the RDD underlying x is mutable. Since none of my RDD transforms mutate any existing RDD data and I am reading from immutable sources (data on S3), I conclude that there must be a bug in Spark or I am mutating my data unknowingly.

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              derrickburns Derrick Burns
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: