Details
-
Bug
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
1.0.2
-
None
Description
ZippedRDD losts some elements after zipping RDDs with equal numbers of partitions but unequal numbers of elements in their each partitions.
This can happen when a user creates RDD by sc.textFile(path,partitionNumbers) with physically unbalanced HDFS file.
var x = sc.parallelize(1 to 9,3) var y = sc.parallelize(Array(1,1,1,1,1,2,2,3,3),3).keyBy(i=>i) var z = y.partitionBy(new RangePartitioner(3,y)) expected x.zip(y).count() 9 x.zip(y).collect() Array[(Int, (Int, Int))] = Array((1,(1,1)), (2,(1,1)), (3,(1,1)), (4,(1,1)), (5,(1,1)), (6,(2,2)), (7,(2,2)), (8,(3,3)), (9,(3,3))) unexpected x.zip(z).count() 7 x.zip(z).collect() Array[(Int, (Int, Int))] = Array((1,(1,1)), (2,(1,1)), (3,(1,1)), (4,(2,2)), (5,(2,2)), (7,(3,3)), (8,(3,3)))