[SPARK-38111] Retrieve a Spark dataframe as Arrow batches - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Question
Status: Open
Priority: Minor
Resolution: Unresolved
Affects Version/s: 3.2.0
Fix Version/s: None
Component/s: Java API
Labels:
- arrow
Environment:

Java 11

Spark 3

Description

Using the Java API, is there a way to efficiently retrieve a dataframe as Arrow batches ?

I have a pretty large dataset on my cluster so I cannot collect it using collectAsList which download every thing at once and saturate my JVM memory

Seeing that Arrow is becoming a standard to transfer large datasets and that Spark uses a lot Arrow, is there a way to transfer my Spark dataframe with Arrow batches ?

This would be ideal to process the data batch per batch and avoid saturating the memory.

I am looking for an API like this (in Java)

var stream = dataframe.collectAsArrowStream()
while (stream.hasNextBatch()) {
    var batch = stream.getNextBatch()
    // do some stuff with the arrow batch
}

It would be even better if I can split the dataframe into several streams so I can download and process it in parallel

Attachments

Activity

People

Assignee:: Unassigned

Reporter:: Fabien

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 04/Feb/22 17:15

Updated:: 12/Dec/22 18:10