[SPARK-2607] SchemaRDD unionall prevents caching - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 1.0.0
Fix Version/s: 1.5.0
Component/s: SQL
Labels:
- cache
- union
Environment:

Linux vb2 3.13.0-30-generic #54-Ubuntu SMP Mon Jun 9 22:45:01 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux

Description

This driver program submitted with spark-submit:

TestUnion.scala

val sc = new org.apache.spark.SparkContext(conf)
  val sqlCtx = new SQLContext(sc)
  val rddForDay1 = sqlCtx.parquetFile(s"hdfs://dell-715-09/user/hive/warehouse/mytable/uivr_year=2014/uivr_month=5/uivr_day=1")
  val rddForDay2 = sqlCtx.parquetFile(s"hdfs://dell-715-09/user/hive/warehouse/mytable/uivr_year=2014/uivr_month=5/uivr_day=2")
  rddForDay1.cache
  rddForDay2.cache
  rddForDay1 union rddForDay2 count

generates these line in the log, thanks to the .cache calls:

14/07/21 11:38:49 INFO BlockManagerInfo: Added rdd_1_0 in memory on dell-715-12.neura-local.com:39169 (size: 689.7 MB, free: 8.0 GB)
14/07/21 11:38:49 INFO BlockManagerInfo: Added rdd_0_0 in memory on dell-715-12.neura-local.com:39169 (size: 744.4 MB, free: 7.2 GB)

If I replace union with unionAll, these lines are not present anymore in the log which makes me think the RDDs are not cached anymore.

Attachments

Activity

People

Assignee:: Michael Armbrust

Reporter:: Thierry Herrmann

Votes:: 2 Vote for this issue

Watchers:: 6 Start watching this issue

Dates

Created:: 21/Jul/14 16:04

Updated:: 15/Sep/15 20:50

Resolved:: 15/Sep/15 20:50