Details
Description
This driver program submitted with spark-submit:
TestUnion.scala
val sc = new org.apache.spark.SparkContext(conf) val sqlCtx = new SQLContext(sc) val rddForDay1 = sqlCtx.parquetFile(s"hdfs://dell-715-09/user/hive/warehouse/mytable/uivr_year=2014/uivr_month=5/uivr_day=1") val rddForDay2 = sqlCtx.parquetFile(s"hdfs://dell-715-09/user/hive/warehouse/mytable/uivr_year=2014/uivr_month=5/uivr_day=2") rddForDay1.cache rddForDay2.cache rddForDay1 union rddForDay2 count
generates these line in the log, thanks to the .cache calls:
14/07/21 11:38:49 INFO BlockManagerInfo: Added rdd_1_0 in memory on dell-715-12.neura-local.com:39169 (size: 689.7 MB, free: 8.0 GB) 14/07/21 11:38:49 INFO BlockManagerInfo: Added rdd_0_0 in memory on dell-715-12.neura-local.com:39169 (size: 744.4 MB, free: 7.2 GB)
If I replace union with unionAll, these lines are not present anymore in the log which makes me think the RDDs are not cached anymore.