Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-38111

Retrieve a Spark dataframe as Arrow batches

    XMLWordPrintableJSON

Details

    • Question
    • Status: Open
    • Minor
    • Resolution: Unresolved
    • 3.2.0
    • None
    • Java API
    • Java 11

      Spark 3

    Description

      Using the Java API, is there a way to efficiently retrieve a dataframe as Arrow batches ?

      I have a pretty large dataset on my cluster so I cannot collect it using collectAsList which download every thing at once and saturate my JVM memory

      Seeing that Arrow is becoming a standard to transfer large datasets and that Spark uses a lot Arrow, is there a way to transfer my Spark dataframe with Arrow batches ?

      This would be ideal to process the data batch per batch and avoid saturating the memory.
       

      I am looking for an API like this (in Java)

       

      var stream = dataframe.collectAsArrowStream()
      while (stream.hasNextBatch()) {
          var batch = stream.getNextBatch()
          // do some stuff with the arrow batch
      }
      

      It would be even better if I can split the dataframe into several streams so I can download and process it in parallel

      Attachments

        Activity

          People

            Unassigned Unassigned
            fabiche Fabien
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated: