Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-6968

Make maniuplating an underlying RDD of a DataFrame easier

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Closed
    • Minor
    • Resolution: Later
    • 1.3.0
    • None
    • SQL
    • None
    • AWS EMR

    Description

      Use case: let's say you want to coalesce the RDD underpinning a DataFrame so that you get a certain number of partitions when you go to save it:

      RDDsAndDataFrames.scala
      val sc: SparkContext // An existing SparkContext.
      val sqlContext = new org.apache.spark.sql.SQLContext(sc)
      
      val df = sqlContext.load("hdfs://examples/src/main/resources/people.avro", "avro")
      val coalescedRowRdd = df.rdd.coalesce(8)
      // Now the tricky part, you have to get the schema of the original dataframe:
      val originalSchema = df.schema
      val finallyCoalescedDF = sqlContext.createDataFrame(coalescedRowRdd , originalSchema )
      

      Basically, it would be nice to have an "attachRDD" method on DataFrames, that requires a RDD[Row], so long as it has the same schema, we should be good:

      SimplierRDDsAndDataFrames.scala
      val sc: SparkContext // An existing SparkContext.
      val sqlContext = new org.apache.spark.sql.SQLContext(sc)
      
      val df = sqlContext.load("hdfs://examples/src/main/resources/people.avro", "avro")
      val finallyCoalescedDF = df.attachRDD(df.rdd.coalesce(8)
      

      Attachments

        Activity

          People

            Unassigned Unassigned
            blue666man John Muller
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Time Tracking

                Estimated:
                Original Estimate - 336h
                336h
                Remaining:
                Remaining Estimate - 336h
                336h
                Logged:
                Time Spent - Not Specified
                Not Specified