Details
-
Bug
-
Status: Resolved
-
Major
-
Resolution: Duplicate
-
1.1.1
-
None
-
None
-
Spark 1.1.1 running on YARN 2.4 via Amazon EMR.
Description
I hate to file bugs that are hard to reproduce (by other people), but after spending a full week trying to debug my code, I constructed a scenario where the following assertion FAILS.
val x : RDD[T] = ....
val y = x.distinct()
assert( y.count() <= x.count() )
I am at a complete loss as to how this can occur under ANY definition of equality/order unless the RDD underlying x is mutable. Since none of my RDD transforms mutate any existing RDD data and I am reading from immutable sources (data on S3), I conclude that there must be a bug in Spark or I am mutating my data unknowingly.
Attachments
Issue Links
- duplicates
-
SPARK-2579 Reading from S3 returns an inconsistent number of items with Spark 0.9.1
- Resolved