Description
generateLDAData uses the same RNG in all partitions to generate random data. This either causes duplicate rows in cluster mode or indeterministic behavior in local mode:
scala> val rng = new java.util.Random(10) rng: java.util.Random = java.util.Random@78c5ef58 scala> sc.parallelize(1 to 10).map { i => Seq.fill(10)(rng.nextInt(10)) }.collect().mkString("\n") res12: String = List(3, 0, 3, 0, 6, 6, 7, 8, 1, 4) List(3, 0, 3, 0, 6, 6, 7, 8, 1, 4) List(3, 0, 3, 0, 6, 6, 7, 8, 1, 4) List(3, 0, 3, 0, 6, 6, 7, 8, 1, 4) List(3, 9, 1, 8, 5, 0, 6, 3, 3, 8) List(3, 0, 3, 0, 6, 6, 7, 8, 1, 4) List(3, 0, 3, 0, 6, 6, 7, 8, 1, 4) List(3, 0, 3, 0, 6, 6, 7, 8, 1, 4) List(3, 0, 3, 0, 6, 6, 7, 8, 1, 4) List(3, 9, 1, 8, 5, 0, 6, 3, 3, 8)
We should create one RNG per partition to make it safe.