Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-2607

SchemaRDD unionall prevents caching

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 1.0.0
    • 1.5.0
    • SQL
    • Linux vb2 3.13.0-30-generic #54-Ubuntu SMP Mon Jun 9 22:45:01 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux

    Description

      This driver program submitted with spark-submit:

      TestUnion.scala
      val sc = new org.apache.spark.SparkContext(conf)
        val sqlCtx = new SQLContext(sc)
        val rddForDay1 = sqlCtx.parquetFile(s"hdfs://dell-715-09/user/hive/warehouse/mytable/uivr_year=2014/uivr_month=5/uivr_day=1")
        val rddForDay2 = sqlCtx.parquetFile(s"hdfs://dell-715-09/user/hive/warehouse/mytable/uivr_year=2014/uivr_month=5/uivr_day=2")
        rddForDay1.cache
        rddForDay2.cache
        rddForDay1 union rddForDay2 count
      

      generates these line in the log, thanks to the .cache calls:

      14/07/21 11:38:49 INFO BlockManagerInfo: Added rdd_1_0 in memory on dell-715-12.neura-local.com:39169 (size: 689.7 MB, free: 8.0 GB)
      14/07/21 11:38:49 INFO BlockManagerInfo: Added rdd_0_0 in memory on dell-715-12.neura-local.com:39169 (size: 744.4 MB, free: 7.2 GB)
      

      If I replace union with unionAll, these lines are not present anymore in the log which makes me think the RDDs are not cached anymore.

      Attachments

        Activity

          People

            marmbrus Michael Armbrust
            thierry.herrmann Thierry Herrmann
            Votes:
            2 Vote for this issue
            Watchers:
            6 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: